Project SmartFi Exploring AI/ML for FinTech News IN COLLABORATION WITH SYNTASA, POWERED BY GOOGLE CLOUD Powered by ABSTRACT The World Bank Finance and Technology Department, in collaboration with The World Bank Technology and Innovation Lab, partnered with Google Cloud and Syntasa Inc. to learn how artificial intelligence and machine learning could enhance the news sourcing and sentiment of FinTech topics globally. This outcome report shares the key learnings and insights as a part of the exploration and development of a prototype. ACKNOWLEDGEMENTS The key learnings outlined in this report were prepared by the Project SmartFi (Smart Finance) team. World Bank Treasury Finance and Technology (TREFT): Paul Snaith, Patrick Cheng, Jaskaran Singh World Bank Technology and Innovation Lab (ITSTI:) Yusuf Karacaoglu, Stela Mocan, Mora Farhad, Mahesh Chandrahas Karajgi, Oleksandra Postavnicha, Yujuan Sun World Bank Corporate Procurement: Sanjay Colaco, Shweta Mesipam Syntasa Incorporated: Shawn Zargham, Michael Finn, Kyle Witt, James Wilson, Eric Bugin, Kareem Sharaf, Ted Blake Google Cloud: Ryan Wright, Rajat Gupta Contents Abbreviations and Acronyms  v Section 1: Overview  1 Executive Summary  1 Project Background  3 Project Team & Sponsor  4 Section 2: Exploration with Artificial Intelligence for Financial News  5 Research Approach  5 Business Challenge Scope  6 Section 3: Collaboration with Google Cloud and Syntasa  9 Rapid Prototyping with Technology Partners  9 Solution Overview and Key Results  14 Technical Approach (Syntasa)  22 Section 4: Learning Outcomes and Future Considerations  37 Technical Learnings for World Bank  37 Business Learnings and Outcome  42 Appendix A: Narrative Dashboard Features  46 Appendix B: Reference Data  50 Appendix C: Brandwatch  55 Appendix D: SmartFi – Trusted Domains Technical Details  58 Appendix E: SmartFi – Uncertain Domains Technical Details  62 Appendix F: SmartFi – Chinese Language Technical Details  65 FIGURES AND TABLES Table 2.1  6 Figure 3.1: Syntasa Solution  10 Figure 3.2: Modeled Mentions  16 Figure 3.3: Word Cloud  17 Figure 3.4: Domain Source  18 Figure 3.5: Domain and PDF Sourcing  19 Figure 3.6 : Trending Topics  20 Figure 3.7: Sentiment Validation  21 Figure 3.8: Sentiment Model Explainability   21 Figure 3.9: Solution Architecture  23 Figure 3.10: Data and AI pipeline  24 Figure 3.11: Chinese Language App Configuration  25 Figure 3.12: Topic Modeling Parameters  27 Figure 3.13: Dashboard Trending Phrases  28 Figure 3.14: Sentiment Explainability  29 Figure 3.15: Sentiment Validation  30 Figure 3.16: Language Translation Performance  32 Figure 3.17: PDF Sourcing  33 Figure 3.18: Sentiment Explainability  34 Figure 3.19: Solution Architecture  35 Figure 4.1: Topic Modeling  38 Figure 4.2: Topic Modeling Explainer  39 Table 4.1: Sentiment Analysis Models  40 iv Project SmartFi: Exploring AI/ML for FinTech News Abbreviations and Acronyms Abbreviation Description Abbreviation Description AI Artificial Intelligence JSON JavaScript Object Notation API Application Programming KPI Key Performance Indicators Interface LDA Latent Dirichlet Allocation App Application LLM Large Language Models AWS Amazon Web Services LookML Looker Modeling Language BARD AI Google’s Generative AI Tool ML Machine Learning BERT Bidirectional Encoder NLP Natural Language Representations from Processing Transformers NMF Negative Matrix BI Business Intelligence Factorization BQ Big Query OCR Optical Character ChatGPT Open AI’s Generative AI Recognition Tool POC Proof of Concept DLP Data Loss Prevention PoV Proof of Value ETL Extract Transform Load RoBERTa Variant of BERT model FedRAMP Federal Risk and RPA Robotics Process Authorization Management Automation Program Saas Software as a Service FinTech Finance and Technology SmartFi Smart Finance FTX Futures Exchange SME Subject Matter Expert GCP Google Cloud Platform TI Lab World Bank Technology and IAM Identity Access Innovation Lab Management TRE Treasury IoT Internet of Things TREFT World Bank Treasury ITSTI World Bank Group Financial Technology unit Technology and Innovation UI User Interface Lab VPC Virtual Private Cloud  v SECTION 1 OVERVIEW Executive Summary In today’s fast-paced world, it can be challenging to stay informed on the latest financial technology news and trends, which can help to inform decisions for financial and operational strategies. The amount of information and opinions available on the internet can be overwhelming, and it can be challenging to filter out what is most relevant and important for business users. Technology is constantly evolving; new trends and developments may emerge daily. To address this challenge, the World Bank Treasury Financial Technology unit (TREFT) and the World Bank Group Technology and Innovation Lab (ITSTI) (hereafter “project team”) worked on a framing exercise to explore how emerging technologies could provide a solution to help users with access to curated, trusted, and relevant news sources that inform them of sentiments across trending topics. The ITSTI lab follows a structured approach using design thinking methodologies to understand the needs, wants, and pain-points of end users. The project team identified a sample list of the key topics and terms of interest; various trusted sources (including open source and subscription content, and social media channels); and the geographic areas of interest, to help guide the data requirements. The team also conducted market research to understand how similar problems are being solved, and to build on the in-lab knowledge. Throughout this research, we worked with the largest search provider, Google Cloud. The Google Cloud Platform (GCP) provides a range of tools and services that are helpful in using machine learning to source news—for example cloud natural language API to extract entities, sentiments, and insights from news articles—among many other capabilities. We also worked with Google Cloud’s partner company, Syntasa Inc., which specializes in sentiment analytics, generating insights through data analytics, and understanding digital behaviors to customize solutions for business users. Overview 1 With Syntasa, which is powered by Google Cloud, we collaborated on designing and creating a prototype of a dashboard that provides users with the ability to gain insights into sentiment trends so that behavior shifts can be quickly identified by topic and by region. The visualization tool we created also provides flexibility in customizing filters, to enable quick access to digestible FinTech topics that can help users stay up to date with the latest trends and developments in their industries; identify new opportunities; and make informed decisions. Our collaboration provided the project team with the opportunity to not only explore potential solutions but also to learn from Syntasa how private technology firms blueprint and develop artificial intelligence (AI) and machine learning (ML) prototypes to scale into enterprise adoption. The World Bank Technology and Innovation Lab (TI Lab) technical team worked closely with Syntasa and Google Cloud to learn how data scientists build custom AI/ML models, and test them for accuracy and explainability regarding transparency, accountability, and compliance, and to ensure that AI systems are fair, ethical, and safe to use. This report outlines the technical learnings, value drivers, and capabilities of the solution we developed. Siphosethu Fanti/peopleimages.com 2 Project SmartFi: Exploring AI/ML for FinTech News Project Background The World Bank’s Treasury Operations, Financial Technology unit (TREFT) helps lead the treasury’s technological advancement initiatives from the ideation phase through development, and successful implementation in close partnership with the treasury business units and technology developers. TREFT actively engages with the Bank’s business units on identifying and implementing suitable technical solutions for business use cases in treasury operations, and their potential development and implementation through in-house and/or off-the-shelf solutions. Such a process requires a constant review of the Bank’s internal technology capabilities and comparison with existing industry standards and new market developments. Consequently, it is immensely important for TREFT to selectively monitor new technology trends and solutions, and subsequently to determine their suitability for the improvement of treasury operations. Currently, this process is being largely performed manually, with a considerable amount of personnel time and resources being dedicated to it on a regular basis. Some of the current challenges include: • Manual sourcing and consolidation of the most relevant and informative FinTech news and events is tedious. • Keeping track of market discussions and public sentiment surrounding notable FinTech topics and events. • Limited search scope in terms of news sources, given the time and resource constraints. • Determining the authenticity of a news source, its thematic relevance, and potential topical categorization. In order to tackle these challenges and to systematically harmonize the process of FinTech and technology news sourcing, TREFT sees a unique opportunity to explore an AI system that mimics human methods in order to quickly and efficiently source curated news relevant to the topics of interest for a specific business unit. A related opportunity comes with automating the process of quantifying relevance, measuring sentiment, and determining the bias of news after it has been sourced. This can be accomplished by mirroring human tactics for measuring how relevant an article is, and determining its overall sentiment and bias, a process which can also be supported through AI methods. Overview 3 Given the existence of these opportunities and the potential benefits of deploying such an AI solution to multiple use cases within treasury, TREFT, along with its partner, Innovation Lab, collaborated in exploring in-house and off-the-shelf solutions which could fulfill the requirements of the use case. Project Team & Sponsor TREFT coordinates the efficient internal administration of the World Bank Treasury’s Information Technology infrastructure across all institutional projects, maintenance, and budget and planning cycles, ensuring that it remains fit for purpose, up-to-date, secure, and reliable. The unit also develops and maintains appropriate strategic technology planning in relation to Treasury’s significant standing in the global financial markets, and leverages that standing to build internal and external partnerships for market and development effect. TREFT’s technology initiatives include leading Treasury’s participation in large-scale system renewals and emerging technology projects in FinTech fields such as AI/ML, blockchain, RPA, and World Bank finance- wide projects. The TI Lab is a specialized unit within the World Bank Group’s Information and Technology vice presidency, centered around three main pillars: innovation, experimentation, and capacity building. TI Lab works closely with various departments and units within the World Bank Group, as well as with external partners, to identify potential areas where emerging technologies can be applied to solve business and development problems. It aims to assist World Bank Group (WBG) business teams in problem framing, requirement gathering, data preparation, technical guidance, and prototype delivery to help decision makers assess whether an investment is worth embarking on for operationalization. The mandate in the TI Lab is to learn by doing and to share knowledge across teams, for continuous innovation. 4 Project SmartFi: Exploring AI/ML for FinTech News SECTION EXPLORATION WITH ARTIFICIAL INTELLIGENCE 2 FOR FINANCIAL NEWS Research Approach 1 What are the most effective methods for collecting and curating news articles related to a specific topic or set of topics? 2 How accurate and reliable are existing sentiment analysis models for analyzing news articles, and what types of customizations or training are needed to improve their performance? 3 How do different sources of news articles (social media, traditional news outlets, blogs) vary in terms of their sentiment and relevance to specific topics? 4 What are the most effective methods for visualizing and presenting sentiment analysis results to users, and how can these be customized to meet the needs of different stakeholders? 5 How can sentiment analysis be used to identify trends and emerging topics in a specific industry or field, and what types of insights can be gained from this analysis? 6 What are the ethical and legal implications of using sentiment analysis to curate and analyze news articles, and how can these be addressed in the development and implementation of the solution? 7 How do different user groups (analysts, executives, investors) use curated news and sentiment analysis, and what other features and functionalities can be important to these users? Exploration with Artificial Intelligence for Financial News 5 Business Challenge Scope The scope of the PoC was determined by the project team in collaboration with Syntasa. Foundational data and base material was provided as inputs to the Syntasa team as detailed below: Relevant topics of interest to TREFT business operations were provided to Syntasa in the form of a holistic Excel document with the following structure. Major themes were developed, and various subtopics were categorized into the themes, which then formed the pool of relevant FinTech and technology-related keywords. To provide additional filter mechanisms and take into account the geographical relevance of the topics, an additional list of geographic locations and regions was provided, with the theme subtopics yielding more specific and relevant search results. A brief example of the structure of the inputs can be seen seen in Table 2.1, and a detailed overview is provided in Appendix B. TABLE 2.1 Theme Asset Tokenization Digital Currency Web3 • Fungible tokens • CBDC (Central Bank Digital • Blockchain Currency) • ICO (Initial Coin Offering) • Cryptocurrency • Delivery versus Payment • NFT (Non-Fungible Tokens) • DApps (Decentralized Apps) (DvP) • Programmable Money • DLT (Distributed Ledger Keywords • Digital Assets Technology) • Programmable Payments • Digital Wallet • Decentralized Autonomous • Carbon tokenization • Stablecoin Organizations (DAOs) • Security Tokens Offering • FOMO (Fear of Missing Out) • Decentralized Finance (STO) (DeFi) • Instant Payment • Interoperability List of Regions List of Domains Filters (North America, South America, Europe, MENA, (federalreserve.gov, ecb.europa.eu, Asia, etc.) bankofcanada.ca, mas.gov.sg, imf.org, etc.) 6 Project SmartFi: Exploring AI/ML for FinTech News Value Proposition The following are value-drivers for the proposed solution: • Stay informed on industry trends and news: Allows users to stay up-to- date on the latest news and developments in the finance and technology industries, including emerging trends and topics. • Gain insights into sentiment trends: Allows users to quickly identify shifts in sentiment towards specific topics or companies, providing valuable insights into market trends and sentiment. • Monitor Partners: Users could track news and sentiment around member countries, NGOs, commercial banks, and other partners, enabling them to stay informed on their actions and strategies. • Make data-driven decisions: Accurate and reliable sentiment analysis on desired topics to help users make data-driven decisions based on real- time insights. • Save time and resources: Users can save time and resources that would otherwise be spent searching for and analyzing news articles manually. Capabilities that could be included in the dashboard to support these value drivers include: • Customizable news feeds: Users could customize their news feeds to only show news articles related to specific topics or keywords, ensuring that they only see relevant content. • Sentiment analysis: Flexibility to filter by sentiment on specified topics or across geographic landscape to understand how different regions or industries react to fintech • Real-time updates: Users may adjust the time horizon to understand how topics in fintech have evolved over time or receive alerts in real time. • Customizable alerts: Users could set up alerts to notify them of changes in sentiment or news related to specific topics or companies, enabling them to stay informed without constantly monitoring the dashboard. • Integration with other tools: The dashboard could be integrated with other tools, such as trading platforms or financial analysis tools, allowing users to make data-driven decisions directly from the dashboard. Possibility of integrating generative AI in future. Exploration with Artificial Intelligence for Financial News 7 By incorporating these value drivers and capabilities, a dashboard that shows finance and technology-related news with sentiment analysis could provide valuable insights and result in time savings for its users. Donson/peopleimages.com 8 Project SmartFi: Exploring AI/ML for FinTech News SECTION COLLABORATION WITH GOOGLE CLOUD 3 AND SYNTASA Rapid Prototyping with Technology Partners Add content on the motivation to learn from the Google Cloud Platform (GCP) platform, and on designing a prototype solution with a technology partner. About Syntasa Syntasa is a cloud-based data and AI platform that enables users to connect various data sources, build and deploy customized AI/ML models, and activate them across various channels through dashboards, data shares, and APIs. This tool provides users with visibility into the full data pipeline, including data source, dependencies, and how the data is being used to drive insights. The Syntasa platform is built with leading open-source technologies, and is powered by GCP services. The Syntasa platform uses the concept of apps (along with the sequencing of those apps) to accelerate time-to-value; improve reliability and efficiency; and provide significant return on investment over home-grown cloud-based solutions. The apps provide low or nocode to full-code capabilities, which allows business users, analysts, data scientists, and data engineers to collaborate, and to leverage and share their expertise. The Syntasa platform runs natively in an organization’s GCP with the data stored in Google Cloud storage and BigQuery. Organizations can keep their sensitive data inside their virtual private cloud (VPC) and behind their firewall, thus maintaining full control, while leveraging the power of advances in big data processing and AI/ML that are being provided by Syntasa and Google Cloud services. Collaboration with Google Cloud and Syntasa 9 FIGURE 3.1: Syntasa Solution By C Malambo/peopleimages.com 10 Project SmartFi: Exploring AI/ML for FinTech News Syntasa’s capabilities make it a powerful tool for rapid prototyping, enabling users to quickly iterate and refine prototypes based on real-time data and insights. Benefits include: • Rapid prototyping from low-code • Integrated production data + drag-and-drop interface and feature + activation pipelines full-code interface • Collaboration, version • Native support for GCP control, and an automated infrastructure documentation framework • Apache Spark and Kubernetes • Advanced job definition, runtime support scheduling, and management capabilities with job failure alerts • Templatized integrations, processes, and apps to enable • Data quality monitoring with consistency and code reuse visibility into data provenance and lineage • Scalable data and AI app framework for development and • Business alerting and model production performance monitoring About the Google Cloud Platform GCP is a suite of cloud computing services offered by Google. It runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, Google Drive, and YouTube. GCP offers a scalable range of computing services such as computing services, networking, storage services, big data, security and identity management, management tools, cloud AI, IoT (Internet of Things) and more. Some examples of GCP services are: Compute Engine, App Engine, Kubernetes Engine, Cloud Functions, Cloud Run, Cloud Storage, Cloud SQL, BigQuery, Cloud Pub/Sub & TensorFlow services. Collaboration with Google Cloud and Syntasa 11 Global Network Google Cloud has a worldwide presence. Google’s global network, connected via high speed cables, makes data movement across the globe in a highly performant and secure manner. Google Cloud offers FedRamp moderate cloud services in Google Cloud data centers around the world, which gives organizations the ability to move data securely and compliantly from one part of the world to another in order to meet key objectives such as data backup requirements. BigQuery BigQuery is Google Cloud’s planet-scale, completely serverless, and cost- effective enterprise data warehouse that works across clouds and scales with your data. With BigQuery, Google has separated compute storage, and connected via the Petabit network, allowing for the compute and storage functions to expand vertically and independently of each other. This allows users to leverage as many compute slots as necessary to answer a query; as a result, BigQuery offers measurable performance gains compared to other analytical systems. • BigQuery Omni: Google gives organizations the ability to leverage BigQuery even if users are housing data with other cloud service providers, or on-premise with BigQuery Omni. When users deploy BigQuery Omni, they are able to query data that is stored on-premise—for example in Microsoft Azure or AWS in a tabular format—as if the data were being stored in a Google Cloud BigQuery environment. This capability allows users to receive all the benefits of Google BigQuery without requiring them to move the data across public clouds. • Data Governance: BigQuery allows for row-level and column-level security as well as other IAM-based permissions at the table and dataset levels. Combined with a DLP solution, BigQuery is one of the most extensible and secure solutions in the cloud today, and these data governance capabilities can also be applied to other clouds via BigQuery Omni. Translation Google Cloud offers out-of-the-box (OOTB) translation capabilities that allow translation in 100+ languages. These translations do not require any pretraining, and are available as APIs to be consumed. These translations are 12 Project SmartFi: Exploring AI/ML for FinTech News some of the highest quality translations in the industry. Today Google offers both text and document translation capabilities. We believe that this will allow the World Bank to meet the needs of its global audience effectively. DocumentAI DocumentAI is another differentiator for Google Cloud. It allows for OCR and Key Value pairs from documents with the highest fidelity. and works particularly well with handwritten documents.is the suite includes Document Warehouse, which is a hosted repository of documents. Document AI and Document Warehouse are going to be the earliest targets for introducing large language models (LLMs), which will allow a unified cloud search experience, along with natural language processing (NLP)-based offerings like summarization, and chatbots. Looker Looker is Google’s cloud-based data exploration, discovery, and data analytics platform. Key information is typically stored in a number of different data stores, each with their own schemas and access processes. Looker provides discovery and real-time analysis of data across multiple data stores, which is critical in understanding disparate information from a business and technical perspective. • Looker strikes a balance between governance and self service in the deployment of analytics. This scalable, real-time approach prevents data sprawl and duplication headaches, including the common issue of having multiple versions of the same business intelligence (BI )reports and dashboards. Looker is capable of presenting dashboards and reports within the application, embedded in portals, and via third-party BI tools such as Tableau. • Looker Blocks are free, reusable, and customizable OOTB templates that provide a head start in creating value from data. With Blocks, nontechnical users can quickly turn data into dashboards that can either be used as-is or be easily customized and blended with other data to meet specific needs. Blocks have been prebuilt to model and visualize a wide range of common use cases such as multicloud cost analysis, data warehouse log analysis, and much more. More than 150 Blocks are available for downloading from the Marketplace: https://marketplace.looker.com. Collaboration with Google Cloud and Syntasa 13 Solution Overview and Key Results The Syntasa Data and AI platform was utilized for this POC to demonstrate rapid prototyping of several sentiment analytics use cases in Google Cloud Platform (GCP). The platform simplifies the use of GCP cloud services for data scientists and analysts, allowing them to either code or visually build their apps. This helps users focus on constructing their data and AI pipelines using familiar user interfaces like Jupyter Notebook or Syntasa’s low/no code workflow processes. The POC involved the creation of six Syntasa apps and Looker dashboards. These apps and dashboards explored a wide range of data and AI capabilities, including data ingestion, topic modeling, sentiment analysis, language translation, trend analysis, and AI explainability. The apps and dashboards covered the following use cases: • Trusted Domains • PDF Sourcing • Uncertain Domains • Trend Analysis • Chinese Language • Sentiment Explanation Funtap/Adobe Stock 14 Project SmartFi: Exploring AI/ML for FinTech News Key Results The key results obtained and demonstrated through dashboards, analysis, and discussions are: • World Bank Group (WBG) domain experts can gain deeper and quicker insights into their subject areas of interest by leveraging automated AI/ML technologies. • WBG domain experts can focus their efforts by using customized narrative and topic modeling apps, and dashboards tailored to their needs by defining themes, keywords, data sources, languages, categories, and geographies of their choice. • Sentiment analytics solutions that leverage large language models (LLMs) can classify positive and negative sentiment with greater than 85 percent accuracy when compared to manually classified relevant text. • Google Translate APIs outperform open-source models by a wide margin, with 96 percent of translations done by Google Translate being deemed acceptable. • Automated data and AI pipelines can extract full PDF reports from trusted sites and apply AI-based summarization and topic modeling to help WBG experts track the latest developments in their topics of interest. • Trend analysis can be fine-tuned to the needs of the WBG team to detect and alert users to rising and falling topics, and to highlight emerging high- visibility events such as the FTX collapse and the Silicon Valley Bank failure. For more details on app configuration and dashboard usage please refer to Technical Approach (Syntasa). Collaboration with Google Cloud and Syntasa 15 SmartFi - Trusted Domains The SmartFi - Trusted Domains app was developed to analyze and provide insights from the “SmartFi” content that is available on trusted domains. The app connects to the Brandwatch API; extracts relevant text; loads data into the GCP; filters and transforms the text for topic modeling and sentiment analysis; adds WBG-defined themes; and prepares an analysis-ready dataset for the trusted domain narrative dashboard. More than five years of historical data has been processed, and the production pipeline is updated daily. The SmartFi - Trusted Domains dashboard, built using Google’s Looker, provides comprehensive visibility into FinTech-related articles about and conversations on trusted domains. Users can analyze key performance indicators (KPIs) and time series charts, and can drill down to original news or social media mentions. The dashboard includes filters for geographic regions, trusted domain categories, domain URLs, and sentiment, allowing for granular analysis of specific regions or categories. The screenshot shown in Figure 3.2 shows a comparison of activity, sentiment, and trends, based on the categories defined by the project teams. FIGURE 3.2: Modeled Mentions For more technical details on data sources, topic modeling see Technical Approach (Syntasa) and for more details on the narrative dashboard see Appendix A. 16 Project SmartFi: Exploring AI/ML for FinTech News SmartFi - Uncertain Domains The SmartFi - Uncertain Domains app focuses on all domains that are not included in the trusted domain list. The workflow consists of multiple steps similar to the ones in the Trusted Domains app, with the addition of a process that uses the Twitter API and Twitter IDs to extract Tweet texts for topic modeling. The data extraction is sampled at 2 percent, and over 3 months of historical data has been processed. The production pipeline is updated daily. The SmartFi - Uncertain Domains dashboard offers comprehensive visibility into FinTech-related articles on and conversations about websites and social media platforms beyond the trusted domains. Users can analyze the impact of major events, such as the FTX and Silicon Valley Bank collapses, and can explore discussions with and without hashtags. The right panel of Figure 3.3 shows the phrases that were present when authors mentioned cryptocurrency exchange. FIGURE 3.3: Word Cloud For more technical detail on data sources and topic modeling, see Technical Approach (Syntasa); and for more detail on the narrative dashboard see Appendix A. Collaboration with Google Cloud and Syntasa 17 SmartFi - Chinese Language The SmartFi - Chinese Language app was created to demonstrate the translation capabilities of Google Cloud, and compare them with open-source translation routines. The workflow consists of multiple steps, similar to those in the Trusted Domains app, with the addition of multiple translation processes. Only one day of Chinese language mentions (a little less than 1M mentions) were processed, and the production pipeline was not activated. The SmartFi - Chinese Translation dashboard offers the same analytics abilities as the Uncertain Domains dashboard, but with a focus on Chinese language content. Users can explore and compare narratives expressed by authors in Chinese, with both the original and translated text displayed side by side. Figure 3.4 shows the domains, authors, and sample translated and original text. FIGURE 3.4: Domain Source For more detail on comparison of translation algorithms, see Chinese Translation. 18 Project SmartFi: Exploring AI/ML for FinTech News SmartFi - PDF Sourcing The SmartFi - PDF Sourcing app was created to demonstrate rapid prototyping capability, exploring both website crawling and search API approaches for automating the extraction of PDF reports from trusted sites. The search API approach was found to be more targeted and efficient. The SmartFi - PDF dashboard provides a faster way to acquire information from trusted data sources, displaying links to PDFs, AI-generated summaries, and topic modeling analysis of the Figure 3.5 below shows that over 5,000 PDF documents were automatically downloaded and analyzed from the European Central Bank site. FIGURE 3.5: Domain and PDF Sourcing For more detail on PDF sourcing implementation, see PDF Sourcing. Collaboration with Google Cloud and Syntasa 19 SmartFi - Trend Analysis The SmartFi - Trend Analysis app analyzes the output of the SmartFi Trusted Domain app to identify rising and falling topics and phrases. Users can customize trend analysis, for example by using a seven-day rolling average to smooth out daily fluctuations. The app has analyzed the trusted domain app output from October 2022 and is updated daily. The SmartFi - Trending Dashboard displays the results of the trend analysis, allowing users to detect and alert rising and falling topics ,and to highlight emerging high-visibility events, such as the FTX collapse and the Silicon Valley Bank failure. The left panel in Figure 3.6 shows the top five topics/phrases by volume, and the right panel shows the top five rising topics/phrases on Nov 10 2022. As can be seen, a day before the FTX collapse on Nov 11 2022, Alameda Research was the top rising phrase in the trusted sources data. FIGURE 3.6 : Trending Topics SmartFi - Sentiment Models and Explainabilty The SmartFi - Sentiment Explanation app was created to address two research questions: 1) comparison of different sentiment analysis models; and 2) analysis of gender and race bias. The FinBERT model was found to be over 85 percent accurate for positive and negative sentiment classification when compared to manually classified relevant text. The SmartFi - Sentiment validation dashboard provides an in-depth view of the sentiment analysis, allowing users to explore the performance of different sentiment models, such as the FinBERT model, and Google’s AutoML. The middle panel in the Figure 3.7 below shows that the FinBERT model was over 85 percent accurate for financial text. 20 Project SmartFi: Exploring AI/ML for FinTech News FIGURE 3.7: Sentiment Validation For more detail on PDF sourcing implementation, see Sentiment Analysis. The dashboard also enables users to analyze potential gender and race biases in sentiment classification, providing insights into ensuring unbiased analysis of financial narratives. FIGURE 3.8: Sentiment Model Explainability For more detail on the model explainability, see Trustworthy and Explainable AI. Collaboration with Google Cloud and Syntasa 21 Technical Approach (Syntasa) Data Sources and Preparation Reference Data Working in collaboration with Syntasa, the World Bank Group (SBG) provided a number of parameters to help scope this project, facilitate data collection, and ensure alignment with WBG business objectives. These data were defined by the WBG in a spreadsheet that included themes and keywords related to financial technology in both English and Chinese; a prioritized list of online news and media websites referred to as Trusted Domains; and geographic regions of interest. (For more detail see Appendix B: Reference Data.) SmartFi - Trusted Domains The goal of the SmartFi - Trusted Domains solution is to extract meaningful insights from the WBG Trusted Domains. The SmartFi - Trusted Domains app contains the pipeline that was created in Syntasa to ingest and process the underlying data needed to accomplish this. The app includes a combination of ready-made and custom processes to ingest the Brandwatch Trusted Domains dataset, and the WBG themes, categories, and regions into BigQuery; then process each mention to mitigate noise; apply the predefined WBG themes, categories, and regions; and extract topics, phrases, and companion phrases. Finally, the data is combined into a single curated dataset used for analysis in Looker. Figure 3.9 shows the data and AI pipeline for the SmartFi – Trusted Domain app configured in the Syntasa Platform. (For more details see Appendix D: SmartFi – Trusted Domains Technical Details.) 22 Project SmartFi: Exploring AI/ML for FinTech News FIGURE 3.9: Solution Architecture SmartFi - Uncertain Domains The SmartFi - Uncertain Domains app contains the pipeline that was created in Syntasa to ingest and process the underlying data needed to extract meaningful insights from sources, explicitly excluding the WBG Trusted Domains. A lighter version of the SmartFi - Trusted Domains app, this app includes a combination of ready-made and custom processes to ingest the Brandwatch Uncertain Domains dataset and the WBG themes into BigQuery; apply predefined WBG themes; and then extract topics, phrases, and companion phrases. Licensing restrictions prevent Brandwatch from providing any Twitter tweet text via the Brandwatch API, so the app also retrieves the full tweet text directly from the Twitter API. Finally, the data is combined into a single curated dataset used for analysis in Looker. Figure 3.10 shows the data and AI pipeline for the SmartFi – Uncertain Domain app configured in the Syntasa Platform. (For more detail see Appendix E: SmartFi – Uncertain Domains Technical Details.) Collaboration with Google Cloud and Syntasa 23 FIGURE 3.10: Data and AI pipeline SmartFi - Chinese Language The SmartFi - Chinese Language app contains the pipeline that was created in Syntasa to ingest and process the underlying data needed to extract meaningful insights from the Chinese mentions. The app functions identically to the SmartFi - Uncertain Domains app, with the addition of a ready-made translation process to translate Chinese snippet text into English. Figure 3.11 shows the data and the AI pipeline for the SmartFi – Chinese Language app configured in the Syntasa platform. (For more detail see Appendix F: SmartFi – Chinese Language Technical Details.) 24 Project SmartFi: Exploring AI/ML for FinTech News FIGURE 3.11: Chinese Language App Configuration Topic Modeling Syntasa has conducted topic modeling on social media and news texts to bring to light the most dominant and frequent conversations contained within them. The strategy is to start with a general subject area—for example, text that contains keywords related to finance—and further breaks it down into expert-defined themes (top-down) and AI-identified topics (bottom-up) for quick discovery of the narratives that are being conversed. Syntasa’s focus is on automating the clustering workflow so as to lower manual oversight, work dynamically on either small or big data, automatically discard irrelevant text, and preserve the most dominant clusters, which will also be self-named. An unsupervised clustering approach is most useful because then the topics (or classes) are not known beforehand. Likewise, developing a classifier through clustering would not be a suitable solution because it would not be able to discover new conversations as they appear in real time. Some of the popular approaches to clustering involve algorithms such as KMeans or LDA, which can be used to group similar sentences/text together, but that have some downsides, especially with very diverse text. Algorithms Collaboration with Google Cloud and Syntasa 25 require knowing beforehand how many clusters it is optimal to create; otherwise the clusters start blending words that have no similarity to each other. Determining the optimal number of clusters (K) requires sampling many different Ks, and having the manual oversight needed to search for that number. There is also no guarantee that the sampling will include the optimal K of the text; rather, the analysis would select only the best K of the sampling. Therefore, searching for optimal K with manual oversight increases computational and labor costs. In the case of social media and news text, conversations can be diverse to the point where it becomes impractical to find the optimal K needed in order to try to force all of the text into respective clusters. Examining the contents of these clusters is usually done by pulling n-grams, bigrams, or trigrams, and an analyst manually determining the “topic” that is being discussed. Because sentences have long structures compared to n-grams, there will be a mixture of unrelated n-grams in a group that is supposed to summarize the cluster content. To overcome the problem of manually naming clusters based on n-grams that likely do not have similarity to each other, a novel approach had to be developed. Rather than focusing clustering at the sentence level, then examining n-grams, Syntasa developed a strategy that begins with a focus on the n-grams themselves. First, the highest-occurring n-grams are used as “topics,” and are used as the cluster centers are checked against other nontopic n-grams for similarity. This allows the topic to be self-named, and only similar phrases to be grouped together; this provides an obvious answer to the content of the cluster. Similarity checks between the n-grams are made by using a BERT embedder with cosine similarity. All of the text that does not get linked to a topic then gets discarded. The discarded text includes very short text that cannot form a valid n-gram; that contains a valid n-gram that does not occur often enough; or that contains a valid n-gram that is unrelated to the top topics. Discarding text is a desired side effect, because it does not contaminate the other text. 26 Project SmartFi: Exploring AI/ML for FinTech News FIGURE 3.12: Topic Modeling Parameters While Syntasa’s topic modeling results in rapid self-naming topics, the term “topic” is itself a subjective term. To some analysts, a topic could be as high- level as “crypto” or “banking,” but to others it might be more granular; for example, “cryptocurrency exchange,” “blockchain technology,” or “smart contracts.” The more granular the topic the more topics there will be; this can overwhelm the analysis, but it can also be more informative about the narrative. To preserve both the high-level and granular topics, Syntasa prepares the data carefully for dashboards in order to give the user full filtering capabilities with which he can narrow the conversation further. A user can start with the larger topics that are generated and then delve into the various narratives surrounding it. Collaboration with Google Cloud and Syntasa 27 FIGURE 3.13: Dashboard Trending Phrases Sentiment Analysis Syntasa conducted an experiment that involved exploring the use of additional NLP models for producing sentiment classifications and determining the agreement levels between the models and members of the World Bank Group. This was an effort to better the built-in sentiment model coming from Brandwatch, which proved to be a black box that made unreliable classifications. Two models from Hugging Face’s model repository were selected. First, there was a model trained on 124 million tweets that learned colloquial conversation; next, a model named FinBERT was trained to understand financial terminology. Both models proved to be good in their respective fields. The Twitter model could accurately identify positive or negative text, for example, in the context of reviewing products (in this case, crypto exchanges), whereas the FinBERT model did a better job of accurately classifying financial terms (for example “surged 27 percent,” or “$40bn implosion”). If a mix of colloquial talk and formal financial talk is to be collected in the future, an ensemble or combination of these models could be used to capture more of the text accurately. 28 Project SmartFi: Exploring AI/ML for FinTech News FIGURE 3.14: Sentiment Explainability In order to evaluate the accuracy of each model (Twitter, FinBERT, and Brandwatch) we asked a WBG domain expert to manually score over 200 texts into negative, neutral, and positive sentiment classes. This enabled us to determine the accuracy of each model using the WBG scores as ground truth. Our model evaluation showed that: • The Brandwatch sentiment model, at only 24 percent accuracy, was unacceptable, as we had seen in working with other clients. • The Twitter roBERTa sentiment model was also unacceptable. It was only 47 percent accurate; after some tuning (setting the confidence score to 70 percent and above) we were able to increase the accuracy, but only to 51 percent. • The FinBERT model, on the other hand, started with 64 percent accuracy and after adjusting the confidence score to 70 percent, the accuracy was increased to 75 percent. Collaboration with Google Cloud and Syntasa 29 The FinBERT model was the only acceptable model we found. We also experimented with ensemble models, in which the answers of the Twitter or FinBERT model could supersede the other model based on higher confidence scores. This did not increase accuracy by a large factor, but it did slightly increase the number of records qualifying above the 70 percent confidence score. The Brandwatch model proved to be the least accurate, and it also did not have the ability to conduct bias tests or see confidence scores. The WBG expertalso classified the text in the experiment to indicate whether it was relevant or not. Greater than 75 percent of the text was deemed relevant. When focusing on WBG’s relevant text, only positive and negative sentiments, and a confidence score of greater than 60 percent, FinBERT produces an accuracy of 86 percent. While a “relevant” classification will not be available on new live data, a text classifier model could be developed to further narrow down the text. The screenshot of the dashboard in Figure 15 shows that 86 percent agreement was captured. FIGURE 3.15: Sentiment Validation 30 Project SmartFi: Exploring AI/ML for FinTech News This dashboard was created in order to observe the results of the sentiment model validation test. The three pie charts show, in order: the distribution of the sentiments that WBG supplied; the sentiment distribution of the model selected; and the percentage of agreement between the WBG expert and the model. Filters allow for the selection of: • The 3 models + the 2 ensemble • Agreement between the World methods Bank Group and the model • The sentiment outputs of • The World Bank Group’s positive, negative, neutral relevance indicator • Confidence scores These filters are useful for narrowing down the acceptability of the models, such as on specific classifications and/or at specific confidence levels. In conclusion, we found that FinBERT, an open-source sentiment model, can be an effective way of producing accurate sentiment classifications that are closely in line with WBG’s expert opinions. We also demonstrated that accuracy can be boosted by adjusting the confidence thresholds, and by limiting the scope to just positive and negative sentiment classes. Chinese Translation The Helsinki-NLP open-source model https://huggingface.co/Helsinki-NLP/opus-mt-zh-en was used for translation into Chinese, with similar models made available for all of the most common languages. Syntasa conducted a comparison of two translation options, Hugging Face/ Generic Models and Google’s Cloud Translation service. These two options differ in several key aspects, including cost, speed, customization, and language support. Hugging Face models are a lower cost option for translation needs. They are easy to customize by simply swapping the model type and offering a relatively low cost run. However, they are slower than Cloud Translation, and they require language detection capabilities in order to handle multiple languages. Hugging Face models are also limited in their language support, as they are designed for specific languages and require additional work to add language detection capabilities. Collaboration with Google Cloud and Syntasa 31 Google’s Cloud Translation service is a fast and flexible option. It can easily be configured to stream directly, and it is capable of translating any language it can detect. This language-agnostic nature makes it a more flexible option for multilingual projects. However, Cloud Translation is also more expensive than the Hugging Face models ($20 per 1 million characters). This can make it a less suitable option for projects that require higher-volume translations and are operating within tight budgets. As part of the comparison, the WBG team manually evaluated the translations produced by each option. The Google translations were found to be superior in quality, mostly because the Hugging Face model didn’t completely translate the entire text. This analysis indicates that both options can be used, but the Google translation is preferable when a higher-quality translation is required. Figure 3.16 shows the comparison of the cloud vs the model translations, where it can clearly be seen that the cloud translation results are superior. FIGURE 3.16: Language Translation Performance Translation Comparison Even 25% Huggingface Win 4% Cloud Win 71% Ultimately, the choice between Hugging Face models and Google’s Cloud Translation service will depend on the specific needs of the project. If cost is a primary concern, Hugging Face models offer a cost-effective solution. If flexibility and speed are important factors, Cloud Translation is the better option, despite its higher cost. The language requirements of the project should also be considered, since Hugging Face models may require additional work to support multiple languages. 32 Project SmartFi: Exploring AI/ML for FinTech News PDF Sourcing To gather PDFs related to specific topics, Syntasa explored two primary methods: searching APIs and website crawling. Search API, the most effective solution, was previously offered by Google but is no longer available. As an alternative, Syntasa used Bing Web Search to gather PDFs related to specific topics. This allowed for a more streamlined and efficient process for sourcing relevant PDFs. While website crawling is an ideal method for gathering every PDF, it proved to be slow, cumbersome, and expensive. Therefore, it is not recommended as a primary method for PDF sourcing. (See Figure 3.17.) FIGURE 3.17: PDF Sourcing Syntasa used the Bart-Large-CNN summarization algorithm to effectively summarize the content of the sourced PDFs. This app can be easily modified to incorporate any other summarization algorithms used by the WBG, or any publicly available summarization models. Other models were tested, including Google’s Pegasus model, but Syntasa did not perform in-depth evaluation and comparison of the two models. The Bart-Large-CNN algorithm performed sufficiently well for this use case since the focus was on PDF extraction. Collaboration with Google Cloud and Syntasa 33 In conclusion, Syntasa used Bing Web Search as an alternative to the previously offered Google Search API to gather PDFs related to specific topics. The Bart-Large-CNN summarization algorithm was used to effectively summarize the content of the PDFs, proving that it’s possible to extract PDF documents from specified domains and summarize them in order to increase the efficiency of the current manual process. For future evaluation and exploration, we recommend a more systematic review of various summarization algorithms. Trustworthy and Explainable AI For the two sentiment models put in place, bias tests were conducted by Syntasa using a Python library called Transformers-Interpret. This library can explain a PyTorch model derived from Hugging Face to display the weights of the features. It uses Facebook’s Captum to apply integrated gradients on the features in order to obtain the weights of each word in the text. By using this library, Syntasa was able to determine whether the way the models reacted to gender or racial terminology was significantly different. Using a word replacement strategy, the same sentence was used while switching out gender or race-related words (for example, “he” or “she”). The sentiment and confidence scores were then examined to determine whether these kinds of words could sway the model. In all of their experiments, the sentiment never changed, regardless of race or gender terms, and the confidence scores had some slight variability. FIGURE 3.18: Sentiment Explainability 34 Project SmartFi: Exploring AI/ML for FinTech News The dashboard screenshot in Figure 3.18 shows the outputs of the model trained on Twitter data and how it reacted to identical sentences where gender- based terminology was swapped out. In all three examples, we can see that the sentiments were the same whether a male or a female term was used; there were also similar confidence levels. In conducting this experiment, Syntasa now has the structure built for future bias tests that will be able to accommodate new types of testing, as desired by the WBG. Solution Integration There are many options available for integrating the Syntasa Data and AI platform, and the sentiment analytics solution with the WBG IT environment. Figure 3.19 shows the technical deployment architecture for the Syntasa Platform in GCP, including the GCP services and network configuration. FIGURE 3.19: Solution Architecture Collaboration with Google Cloud and Syntasa 35 The current POC was conducted as a private cloud SaaS where a single- tenant solution was hosted in a dedicated GCP project within a fully controlled Virtual Private Cloud (VPC). A similar solution architecture has been deployed for clients with highly sensitive data, and this architecture, when deployed in the WBG’s GCP organization, can achieve the highest level of compliance, including FedRAMP High. The solution can support Single Sign On to simplify the connectivity from the WBG network, using the existing corporate authentication services. For the POC, since only publicly available data was used, it was determined that the simplest compliant option was to host the solution in a Syntasa- controlled GCP project, and use the WBG GCP billing account. Given the initial success of the POC in demonstrating the potential of using large language models for automating text, and sentiment analysis for several use cases, with the additional exploration, development, and testing required to reach a production-ready state, we can envision proceeding with either a similar arrangement of a Syntasa-controlled GCP project, or a WBG-controlled GCP organization, folder, and project. 36 Project SmartFi: Exploring AI/ML for FinTech News SECTION LEARNING OUTCOMES AND FUTURE 4 CONSIDERATIONS Technical Learnings for World Bank Topic Modeling Topic modeling is a statistical and computational technique used to identify underlying topics or themes within a collection of texts or documents. It is a process of extracting meaningful patterns or themes from large volumes of text data. The goal of topic modeling is to identify the most significant topics present in the documents without prior knowledge of the topics. The most commonly used types of topic modeling are Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF). LDA is a probabilistic model that assumes that each document contains a mixture of topics, and each topic is a probability distribution over words. The model infers the topics based on the distribution of words in the documents. The output of LDA is a set of topics, along with the distribution of each topic across the documents, and the distribution of each word across the topics. NMF is a matrix factorization technique that decomposes the document- term matrix into two matrices, one representing the topics, and the other representing the words in the topics. The output of NMF is a set of topics, along with the weight of each word in the topics. Previously, TI Lab worked on several projects that required topic modeling. LDA was used primarily to tackle the grouping of documents into clusters. During our collaboration with Syntasa, they introduced us to their custom algorithm, which has proven to be more accurate and robust than LDA. Learning Outcomes and Future Considerations 37 Syntasa’s team introduced us to a mix of three different algorithms used to achieve the project’s objective. This objective is based on the need for the text snippets to be assigned to multiple topics and for the topics to be named automatically. Since FinTech-related social media posts are extremely diverse, using the LDA algorithm is no longer a viable option. FIGURE 4.1: Topic Modeling K-means clustering (highest occurring phrases as cluster centers) Fast clustering (similarity checks using Syntasa clustering cosine similarity with volume criteria) Graph Networks (to link snippets to multiple topics) Syntasa’s team introduced us to mix of three different algorithms to achieve the project’s objective. This objective is based on the need for the text snippets to be assigned to multiple topics and for the topics to be named automatically. Since fintech-related social media posts are extremely diverse, using LDA algorithm is no longer a viable option. The K-means topic modeling technique is a clustering method that groups the documents into a fixed number of topics based on the similarity of their word frequencies. The key disadvantage of this modeling technique is that it requires manual naming of topics, and it assigns only one topic per snippet of text. It is also sensitive to the initial conditions, and the results may vary depending on the random initialization of the algorithm. However, if used in combination with other algorithms, it can provide valuable insights. Fast Clustering is another type of topic modeling that works somewhat like hierarchical clustering, but is tuned for speed. It is useful when the number of clusters is unknown and the dataset is quite large. With fast clustering, the developer can freely configure the threshold of what is considered to be similar. A high threshold will only find extremely similar sentences; a lower threshold will find more sentences that are less similar to each other.1 1 https://www.sbert.net/examples/applications/clustering/README.html 38 Project SmartFi: Exploring AI/ML for FinTech News Graph Networks can also be used to link multiple topics to a text. Graph Networks represent the documents and topics as nodes in a graph, and the relationships between them as edges. By analyzing the graph, it is possible to identify the most significant topics and their relationships to the documents. Graph Networks can also be used to visualize the topics and their relationships, making it easier to interpret the results. The solution is fully scalable, using Apache Spark on large data to take advantage of the Could infrastructure. The outcome for a real-life example is described using the image shown in Figure 4.2. The phrase that was analyzed is “The blockchain network allows users to avoid Central Banks.” This sentence clearly has more than one topic, and the figure shows how it can be connected to three different topics: for example, Allows Users; Blockchain Technology; and Central Banks. FIGURE 4.2: Topic Modeling Explainer Learning Outcomes and Future Considerations 39 Sentiment Analysis Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Open source software tools, as well as a range of free and paid sentiment analysis tools such as RoBERTa, Google Cloud translation, and BERT automate sentiment analysis on large collections of texts, including web pages, online news, and blogs. Sentiment analysis is well-used at ITSTI to analyze internal documents, risk management, feedback review, online and social media data, and so on. Pretrained models with different datasets have different capabilities and strengths. Sentiment models should be selected based on the specific business demands and the available data. After exploring three Sentiment Analysis models for financial data in this prototype, we determined that the FinBERT model focuses on financial data and produces better results. TABLE 4.1: Sentiment Analysis Models Brandwatch​ FinBERT​ Twitter roBERTa​ Multilingual Sentiment FinBERT is a pre-trained roBERTa-base model Model​ NLP model to analyze trained on ~58M tweets sentiment of financial text​ and finetuned for sentiment analysis with the TweetEval benchmark​ Hybrid approach to Narrow focus on financial Effective at picking Sentiment Analysis: data​ up colloquial talk​ Knowledge-Based -> ML -> Custom Rules​ Chinese Translation AI translation is a machine translation process based on complex, deep learning algorithms. Using intelligent behavior, it can understand a source text and generate another text in a different language. The translation is required in order to build a more robust tool covering other languages. Since Chinese is much used in FinTech-related data in Asia, during the Syntasa engagement, we applied both Simplified Chinese and Traditional Chinese themes and keywords to collect media data in Chinese. Then we tested different translation services on snippets, and compared the quality of Google Translation with Hugging Face. The results show that Google Translation 40 Project SmartFi: Exploring AI/ML for FinTech News performs better than Hugging Face in terms of the completeness and accuracy of the content; that is, Google is more comprehensive than Hugging Face and it also works for long and complex texts. It also has more accuracy in some key verb translations. Hugging Face also usually misses some content, especially in the context of a long sentence, and it can’t recognize many professional terms and proper nouns, such as brand names (for example, Moutai). But when the sentence is short, Hugging Face is concise and accurate; it is not worse, and sometimes it is even better than Google. Business Intelligence Tool: Looker Looker by Google is a business intelligence (BI) and data analytics platform, aligned with Microsoft Power BI. This web-based tool offers plenty of analytics capabilities that businesses can use to explore, discover, visualize, and share analysis and insights. Looker earns good marks for reporting granularity and scheduling, drag and drop interface, and prebuilt templates and data models. Looker has more colorful UI graph options and a customizable layout size. It is easy to apply Looker to visualizing results and building enterprise-level products such as dashboards and websites. However, since Looker was integrated into Google’s system just a few years ago, it has limited AI and statistical functions. The price is also higher than Power BI. Learning Outcomes and Future Considerations 41 Business Learnings and Outcome This section describes how the dashboard can be useful for finance and technology users. Key Learnings (Technology) 1 Significance of input data: Input data is the foundation of any solution that aspires to use emerging technologies like artificial intelligence. Therefore, it is important to ensure that the data used to train AI models is accurate, representative, and sufficient in quantity. 2 Explainability and transparency: As AI models become more complex, it is important to ensure that they are explainable and transparent. The decision- making process of the model should be easily understood and verified by humans concerning which data is relevant; what data can be categorized into which theme/keyword; what data to exclude, and so on. Explainability and transparency can also help to build trust in the solution. 3 Continuous technology learning and improvement: One of the significant advantages of AI is that it can improve over time; but this requires continuous feedback and training. It is important to continuously monitor and evaluate the performance of AI models, and update them to ensure the relevance of results over time. Key Learnings (Project) 1 Clear base requirements: It is important to have clear and well-defined requirements for such a PoV. This will help to ensure that everyone involved in the project is on the same page and has a common understanding of what needs to be achieved. Technical scoping sessions are relevant steps in the process of streamlining project requirements, and ensuring their alignment with the relevant business needs. ITSTI, along with TREFT and Syntasa, will set up dedicated scoping sessions at project initiation to clarify the basic project requirements. 42 Project SmartFi: Exploring AI/ML for FinTech News 2 Stakeholder engagement through collaboration and expertise: It is important to involve relevant stakeholders throughout project engagement, from the ideation phase through to scoping and development. TREFT has performed the role of the business user collaborating with ITSTI to finalize the business and technical requirements, and has collaborated with Syntasa as the developer of the solution. 3 Agile approach: An agile approach toward this project enabled the solution to be developed as close to the relevant business needs as possible. Given the possibility of showcasing a key functionality during the engagement and its alignment with the business needs, the project teams tested the PDF Summarizer function in lieu of API integration. 4 Testing and quality assurance: Continuous manual testing and analysis of parts of output data at different stages of the engagement has helped to maintain business relevance and ensure quality assurance. During this engagement, manual testing was especially important in areas related to topical relevance, quality of translation, and user interface. This helped to prevent issues and ensure that the solution is reliable and effective. Key Business Outcomes: 1 Efficiency. The ability to intelligently source relevant FinTech news by mimicking human logic, and to present it on a dynamic dashboard powered by Google’s Looker platform contributes to streamlining the tedious news-sourcing process, and reveals detailed insights on digital trends, and sentiment on the topics. Such a solution could help to save time and resources that would otherwise be spent sourcing important news manually. It could also reduce human error in identifying news sources that are potentially biased or irrelevant, as well as gather relevant news sources that a human might miss due to the massive volume of news data on the internet. 2 Scalability. The consolidation and representation of large volumes of data on a dynamic dashboard such as Looker allows the user to customize search criteria based on user needs, and categorize data by drawing its relation to topical areas of interest. The functionality of reviewing market sentiments across multiple topics presents interesting insights that can be used as inputs in creating briefing notes, resources, knowledge material, slide decks, and reports for senior management review and the wider TRE audience. Learning Outcomes and Future Considerations 43 3 Relevance. Ultimately, this solution can also allow treasury staff to stay fully involved in and informed about the most relevant happenings within the topics of interest, enabling the organization to potentially capitalize on key opportunities for innovation within this space, and leverage these technologies to improve TRE operations. The applicability of the solution to other use cases is another opportunity. Currently this solution captures news and material on a specific list of topics, and captures them from specific sources as defined by the project team. There is a possibility of changing the list of topics and sources, thus indicating the potential universality of the base solution (with customized features) across various use cases. Considerations for Production Solution • Chatbot integration/plug in (BARD AI or ChatGPT): A solution that could enable the user to source the relevant information by conversing with a chatbot. • Language translation: A solution that could capture resources and materials in a multilingual setting, thus increasing the geographical reach and revealing more significant results. • PDF summarizer: A solution where large text files/PDFs are converted into an easily understandable and brief summary, with suggestions for how it could increase convenience for users. • Expand scope to test intelligence: A solution where the input data is more broadly categorized, and the output data is expected to be even more specific and filtered. 44 Project SmartFi: Exploring AI/ML for FinTech News Appendices APPENDIX A Narrative Dashboard Features A templated narrative dashboard was deployed to hasten development time, limiting the scopeto first insights. Although each dashboard was then customized to best meet the requirements set forth in this POC, they share many of the same features: Filters To facilitate noise mitigation and focused exploration, two types of filters are included on the dashboard: cross-chart filters, and top-level filters. Cross-chart filtering enables users to interact with most of the elements on the dashboard. For example, on the topics table, if a specific topic is selected, the dashboard will filter all of the charts that are based on the selected topic. Top-level filters appear at the top of the dashboard, providing extensive filtering capabilities and allowing for the selection of a specific date range and time series chart granularity; inclusion and exclusion of any combination of themes, topics, phrases, companion phrases, types of mention (unique vs repeat), page type, domain, author, and/or language can be arranged. KPI Scorecards KPI measures at the top of the dashboard include the number of sampled mentions, modeled mentions, percentage of mentions modeled, calculated net sentiment, and oldest and newest mention dates with respect to the applied filters, providing a high-level overview. 46 Project SmartFi: Exploring AI/ML for FinTech News Volume and Sentiment Time Series These two visualizations, found underneath the scorecards, show FinTech sampled and modeled mentions by volume, and net sentiment over time. In addition to showing how volume and sentiment are changing over time, peaks and valleys are often indicative of significant events of interest that may warrant further investigation. Countries The dashboards include a table that shows mention volume, percentage, and sentiment by country, along with a heat map visualization. Through these features, users are able to understand and compare the level of engagement and sentiment in various countries. Themes To facilitate top-down analysis, a series of tiles provide mention volume and sentiment by theme; mention volume by theme over time; and mention sentiment by theme over time. As detailed in Reference Data, the themes were provided by WBG SMEs. They include Asset Tokenization, Digital Currency, and Web3, and are consistent across all three narrative dashboards. This is useful for understanding and comparing proportionality and sentiment across various known areas of interest. The time series charts visualize changes in the discussion to help users understand the ebb and flow of engagement and sentiment for these themes. Topics Complementary to the top-down approach of themes, topics can be thought of as being constructed from the bottom up. Using AI and natural language processing (NLP), the mentions are analyzed to identify recurring phrases and are dynamically grouped into topics. For example, the phrases “bitcoin,” “btc,” and “ethereum” might be categorized under the topic “cryptocurrency.” Topic Modeling provides more information on the topic of modeling implementation. As with themes, the same series of tiles is provided for topics to show how prevalent the initial topics of interest are in digital narratives, as well as additional topics that are emerging from the conversation. Often many of the Learning Outcomes and Future Considerations 47 collected mentions do not fit inside one of the predefined themes. These tiles typically surface as previously unrecognized topics of discussion that are taking place outside of the predefined themes, and are likely of interest. Phrases & Companion Phrases Phrases are identified by the algorithm using parts of speech to select the most relevant phrases and words. The algorithm also identifies the companion phrases that are used most commonly with each phrase. These tables show the most common phrases and companion phrases in FinTech-related posts and articles by volume and sentiment. Accompanying word clouds allow for visual analysis. Phrase volume and sentiment can be compared in order to understand the multitude of narratives taking place. One particular phrase can also be selected for deep analysis. By reviewing the associated companion phrases, users are able to determine the specific subject matter being discussed in relation to the broader topic of conversation. For example, when selecting the phrase “bitcoin,” the top two companion phrases that appear might be “ethereum” and “cardano.” This suggests that mentions that include the phrase “bitcoin” are often discussing “Ethereum” and “cardano” in relation to bitcoin. Page Type Page Type refers to the category of website the mention was found on; that is, news, forums, or blogs, as well as large social media platforms like Twitter, Facebook, and YouTube. The dashboards include the same series of tiles for Page Type as with themes and topics, and provide insights into where the discussion is taking place, comparative sentiments, and changes over time. Reach Estimate is an additional measure included here to explain which Page Type participants are most likely to engage with. (See more in Reference Data.) For example, a minority of the mentions may come from Twitter compared to mentions in news, suggesting that the bulk of the discussion is happening in the news. However, Twitter’s significantly higher reach estimate indicates that despite fewer mentions on the platform, significantly more people are likely to be exposed to those mentions. 48 Project SmartFi: Exploring AI/ML for FinTech News Domains A domain tile is included to analyze volume and sentiment. . Domain is the domain name of the website from which the mention originated (for example, Twitter.com). This table allows the user to understand and compare engagement and sentiment across domains, or filter mentions to focus analysis on one or more domains. Authors Author is the nickname, user name, or full name of the entity that posted a mention. The authors table displays the author of a given post or comment, the domain the content was posted to, the number and net sentiment of mentions authored, and the author’s reach estimate. Users are able to identify key participants, their sentiments, and their relative influence on the discussion. Mention Details The original text of the mention is displayed in the Mention Details table. This table reveals the author of the comment, the text of the mention, the originating domain, and the date it was posted, thus providing users with an expanded context. A URL link button to the original source of the mention is included to facilitate in vivo analysis. An impact score for each mention is also included to help users understand the relative impact a mention is likely to have had in the discussion, as discussed in Reference Data. Learning Outcomes and Future Considerations 49 APPENDIX B Reference Data Themes and Keywords The relevant smart finance keywords in the list were grouped and categorized by the World Bank Group, generating a total of three themes of interest, based on WBG business use cases: asset tokenization, digital currency, and Web3. Keywords ranged in specificity from a particular cryptocurrency such as Bitcoin, to more generalized terms, such as digital wallet. Asset Tokenization Asset Tokenization theme contained approximately 21 keywords: • Bitcoin • Programmable • Onyx Money • Circle (USDC) • Orion • Programmable • Cold Wallet/Hot • Digital Promissory Payments Wallet Note • Sats/Satoshis • Cryptowinter • Digital Financial • Tether (USDT) Market Infrastructure • Ethereum (DFMI) • Stellar Development • Fungibile tokens Foundation • carbon tokenization • ICO (Initial Coin • Security Tokens • carbon credits/ Offering) Offering (STO) certificates • NFT (Non-Fungible • Digital Assets Tokens) Platform (DAP) 50 Project SmartFi: Exploring AI/ML for FinTech News Digital Currency Digital Currency theme contained approximately 24 keywords: • Adoption • Digital Wallet • Stablecoin • Apple Pay • Double Spending • Wholesale CBDC • CBDC (Central Bank • Fiat currency • Ripple Digital Currency) • Financial inclusion • Retail Central Bank • DCEP (Digital Digital Currency • FOMO (Fear of Currency Electronic (or Retail CBDC or Missing Out) Payment) / e-CNY / rCBDC) Digital Yuan • Google Pay • Wholesale Central • Delivery versus • Instant Payment Bank Digital Payment (DvP) • MetaPay Currency (or Whole CBDC or wCBDC) • Digital Assets • Public-Private Partnership (PPP) • Atomic settlement • Digital Dollar • Digital Euro • Retail CBDC Web3 Web3 theme contained approximately 21 keywords: • Blockchain • Traditional Finance / • Total Value Locked TradFi (TVL) • Cryptocurrency dApps (Decentralized • Decentralized • Loss/bankruptcy/ • Exchange (DEX) fraud/hack Apps) • Oracle • Decentralized • DLT (Distributed Finance (DeFi) Ledger Technology) • Hyperledger • Interoperability/ • Ledger • Decentralized Interoperable/Bridge Autonomous • Metaverse Organizations • Flash Loans • MiCA—Markets in (DAOs) Crypto-Assets Law • Liquidity Pool • Regulation • Market Capitalization • Smart contract (Market Cap) Learning Outcomes and Future Considerations 51 Chinese Keywords The English keywords were later translated to Simplified Chinese and Traditional Chinese to facilitate collection and analysis of FinTech-related data authored in Chinese and likely originating from individuals and media sources closer to the Chinese markets (for example, Singapore). Initial translations were made by Syntasa using Google Translate service. These initial results were refined by WBG personnel who are fluent in written Chinese, and familiar with relevant cultural references related to smart finance. Asset Tokenization (Simplified Chinese) Asset Tokenization keywords in Simplified Chinese shown with multiple synonyms separated by “/”: 资产代币化, 比特币, 世可/Circle/比特币Circle/比特币银行/比特币银行Circle, USDC, 冷钱包/硬件钱包/离线钱包, 热钱包/软件钱包/线上钱包, 加密寒冬/ 加密货币寒冬, 以太坊, 同质化代币/可替代代币/同质化通证/可替代通证, ICO/ 首次代币发行/首次发行代币/数字货币首次公开募资/数字货币首次公开发行/ 首次币发行, NFT/非同质化代币/非可替代代币/非同质化通证/非可替代通证/ 不可替代代币, 可编程货币/程序化货币, 可编程支付/程序化支付, Sats/Satoshis/ 中本聪, Tether/稳定币Tether, USDT/泰达币/稳定币USDT, 恒星币/ XLM(Stellar)/ 恒星网络/XLM, STO/证券型通证发行/证券化通证发行, 数字资产平台, DAP / DAP 币, Onyx / Onyx币, Orion / Orion币, 数字本票/数字期票, DFMI/ 数字金融市场基础设施, 碳币 Digital Currency (Simplified Chinese) Digital Currency keywords in Simplified Chinese shown with multiple synonyms separated by “/”: 数字货币, 采用, 苹果支付, CBDC/中央银行数字货币/央行数字货币, DCEP/ 数字货币电子支付/数字货币和电子支付工具/ "DC/EP", 数字人民币/e-CNY, 货银对付/DVP/券款对付, 数字资产, 数字美元, 数字欧元, 电子数字钱包/数字钱包, 双重支付/重复花费/双花, 法定货币, 普惠金融/金融包容性, 错失恐惧症/FOMO/ 害怕错过/社交控, 谷歌支付/Google Pay, 即时付款, Meta pay / 脸书支付, 公私合作制/公共私营合作制/政府和社会资本合作模式/公私伙伴关系/PPP, 零售央行数字货币/零售CBDC/零售中央银行数字货币/零售型央行数字货币/ 零售型CBDC/rCBDC, 稳定币, 批发央行数字货币/批发CBDC/ 批发中央银行数字货币/批发央行数字货币/批发型CBDC/wCBDC, 瑞波币, 原子清算/原子结算 52 Project SmartFi: Exploring AI/ML for FinTech News Web3 (Simplified Chinese) Web3 keywords in Simplified Chinese shown with multiple synonyms separated by “/”: web3, 区块链, 加密货币/密码货币/加密数字货币//虚拟货币, dApp/ 去中心化应用程序/分布式应用程序/去中心化应用/分布式应用, DLT/分布式帐本技术/ 分布式记账技术/分布式记账方式, 分布式帐本, 分类帐/分类账簿, 元宇宙, 欧盟加密资产市场监管法案/加密货币监管协议/MiCA, 监管, 智能合约, 传统金融/ TradFi, 去中心化交易所, 价值中介, Hyperledger/超级账本, DAO/去中心化组织/ 去中心化自治组织, 流动性池/流动资金池/流动性储备资金, 市值, TVL/总锁定价值/ 锁定的总价值, 损失, 破产, 欺诈, 黑客, 去中心化金融/分布式金融/DeFi, 互操作性, 可互操作, Bridge/区块链桥, Interoperab, 闪电贷/Flash Loan Geographical Locations WBG provided a list of 34 individual and collective countries of interest grouped into six geographic regions: • North America (US, Canada, • Africa (Central African Republic, Mexico, Bahamas, and Caribbean) Democratic Republic of the Congo, Ghana, and South Africa) • South America (Brazil, Ecuador, and Colombia) • Asia (China, Hong Kong, India, Kazakhstan, Singapore, South • Europe (European Union, Euro Korea, Taiwan, Thailand, Japan, Area, European Economic Area, Australia, New Zealand, and Ukraine, and Russia) Vietnam) • Middle East and North Africa (MENA—UAE, Saudi Arabia, Qatar, Israel, Turkey) Trusted Domains The WBG provided a list of 82 organizations of prioritized interest relating to the predefined themes and keywords, accompanied by their website address (domain) and grouped into categories by organization type. Learning Outcomes and Future Considerations 53 Organization Categories Central Bank, Consultancy, Digital Currency Institution, News Sources, Financial Services, International Development, Regulatory Body, Research Center, Technology Company, and Think Tank. These organizations represent a combination of authority figures, key players, and news sources participating in the many facets of finance. They are considered by WBG to be generally reliable, authentic, and trustworthy sources of information that is highly relevant to WBG business interests. As such, the collection was labeled Trusted Domains, referring to their website domain for the duration of the project. Notably absent are social media platforms, including Twitter and Facebook. 54 Project SmartFi: Exploring AI/ML for FinTech News APPENDIX C Brandwatch Brandwatch Social Media Listening Platform Brandwatch is a social media listening and analytics platform that provides access to a wide range of online data sources including websites, social media platforms, and news. Brandwatch automates the process of capturing data from various sources. The platform uses web crawlers to continuously gather data from millions of websites, including blogs, forums, and news sites. It gathers news articles from thousands of sources, including major news outlets, blogs, and online publications. Users also have access to data from all of the major social media platforms (Facebook, Twitter, Instagram, LinkedIn, YouTube, and Reddit). Brandwatch’s query feature is used to build complex queries to retrieve data that meets specific criteria, using key terms of interest in SQL-like queries to retrieve relevant data such as mentions of a brand or product, competitor activities, and industry trends. Some of the key capabilities of Brandwatch’s query feature include: Advanced filtering A wide range of filtering options may be used in a query, allowing users to narrow down their search results to only the data that is relevant to their research. Filters can be applied based on a variety of criteria, including time period, language, country, author, source type, and more. These can also filter out irrelevant data, reducing the amount of noise in your dataset. Learning Outcomes and Future Considerations 55 Boolean operators Queries also support boolean operators, such as AND, OR, NOT, and NEAR. This enables users to create complex search queries that combine multiple search terms and filter criteria. Although Brandwatch also provides a range of analytics and visualization tools, these capabilities are limited in comparison to those that are easily achievable using Syntasa and Google Cloud. Through the Brandwatch API, we’re able to take advantage of this automated data capture with comprehensive coverage provided in near real-time; this can save time and effort compared to manually scraping data from these sources. Brandwatch’s mention metadata fields provide a rich set of information that can be used to filter, analyze, and visualize social media and online content. Here are some of the metadata fields that are available in Brandwatch and commonly used in Syntasa’s news and social media narrative solutions. Snippet Snippet is a snippet of the mention that best matches the query. Page Type Describes the kind of website the mention was found on in a more human- readable way. For example: “Blogs” “YouTube” “Dark Web” “QQ” “Facebook” “Tumblr” “Instagram” “Forums” “Twitter” “VK” “Review” “Sina Weibo” “Reddit” “4Chan” “LexisNexis Licensed News” “News”. Impact Impact is a Brandwatch metric used to measure the potential impact of an author, site, or mention. It has a logarithmic scale between 0–100, normalized for the users’ data to help them find what is most interesting for them. The impact score takes into account how much potential a mention has to be seen, as well as how many times it has been viewed, shared, or retweeted. (A decimal from 0–100.) 56 Project SmartFi: Exploring AI/ML for FinTech News Reach Estimate Reach Estimate is a score created by Brandwatch to estimate how many individuals may have seen a piece of content. It is available for multiple data sources, and enables the user to compare the reach of content from different platforms and track development over time. (0, or a positive integer.) Sentiment Each mention within a query has a sentiment associated with it. The sentiment of a mention can be positive, negative, or neutral. Sentiment is assigned automatically by the system, but can be selected manually if required. Brandwatch’s sentiment analysis is based on cutting-edge AI research in the fields of Deep Learning and Natural Language Processing (NLP). Transformer Architecture Language Models are pretrained on billions of words to develop a deep general knowledge of over 100 languages before being applied to sentiment analysis. This offers a more sophisticated understanding of context, slang, and dialects. These models can detect sentiment indicated by: • Words (including misspelled words), phrases, and sentence structure • Emojis, emoticons, and multiword hashtags • Negation, punctuation, and much more. Learning Outcomes and Future Considerations 57 APPENDIX D SmartFi – Trusted Domains Technical Details A Brandwatch query was constructed using the three themes and associated English keywords mentioned in Reference Data. The location filter was set to “worldwide” to enable later geographic analysis in the dashboard. Since the keywords were in English, to alleviate the need for additional translation in the Syntasa app, the language filter was set to English to ensure that only English content is searched and returned. Pluralized and wild-card variants of the keywords were included in the query. The “NEAR” operator was used to reduce noise created by generic keywords by helping to ensure that their presence in the mention occurred alongside other themes and keywords of interest. (For more details on Brandwatch features and data sources see Appendix C.) As the title suggests, the SmartFi - Trusted Domains exploration was primarily focused on the WBG list of 82 organizations of prioritized interest relating to the predefined themes and keywords. As such, advanced filtering was applied in the Brandwatch query to include only the results from those organizations’ website domains (Trusted Domains). The final SmartFi - Trusted Domains dataset in Brandwatch consists of approximately 1M mentions in English from approximately 274k unique authors found worldwide across the 82 Trusted Domains, from January 1, 2018 through February 28, 2023. A “mention” refers to a specific instance of a keyword being 58 Project SmartFi: Exploring AI/ML for FinTech News mentioned on social media, news sites, blogs, forums, or any other online source that Brandwatch monitors. A mention can be in a tweet, a Facebook post, a blog post, a news article, a forum thread, or any other piece of content that contains the specified keyword. Syntasa SmartFi - Trusted Domains App Ingest Brandwatch dataset The ready-made Brandwatch API process included with Syntasa was configured with the Brandwatch Trusted Domains query ID to ingest the Brandwatch Trusted Domains query dataset into BigQuery at a 100 percent sample rate via Brandwatch’s commercial API. Each mention in the Brandwatch Trusted Domains dataset contains up to 103 associated mention metadata fields depending on the source, type, and data availability of the mention. These fields include date, author, domain, page type, sentiment, impact, reach, snippet, and geographical information (when available). In addition to the Brandwatch data, the reference data in Appendix B were also ingested. The reference data were first manually copied into a single Google Sheet with three tabs: Themes and Keywords, Trusted Domains, and Regions and Countries. Three Spark processors, one for each tab, use Python code to access the relevant tab via Google Cloud Storage API and insert them into a BigQuery table. Process Noise Filter Visual analysis in the SmartFi Trusted Domains - Narrative Looker dashboard of the most recent 30 days of mentions revealed a high number of irrelevant forum and review mentions originating from the trusted domains that were categorized as Technology Company. These mentions included knowledge- base articles, technical support forums, and app store reviews from Amazon, Microsoft, Google, and Apple. Learning Outcomes and Future Considerations 59 Syntasa provides a multitude of ways to implement noise mitigation in the pipeline, including predefined processes with filtering parameters, or the option to define custom scripts and SQL queries. For demonstration in this POC, a Transform process was inserted in the app and an SQL “WHERE” clause was added to the filters to exclude the aforementioned mentions: where ((categories.Category != “Technology Company”) and ((data.pageType != “forum”) AND (data.pageType != “review”))) Themes A Big Query (BQ) process was used to label themes associated with each mention based on matching keywords. Referencing the predefined Themes and Keywords, a mention was labeled Asset Tokenization, Digital Currency, or Web3 if the snippet contained at least one keyword associated with one of these themes. Categories An organization category was assigned to each mention using the “Join” feature of the same Transform process containing the noise filtration mentioned above. Mentions were labeled with one of the ten Organization categories, based on a matching originating domain. Regions A geographic region was assigned to mentions that have an associated country provided by Brandwatch. Although this could easily be done using a process in the app, for demonstration purposes this was implemented in Looker using LookML. Similarly to how organization categories are assigned, the LookML references the Geographical Locations to assign one of the six predefined regions, based on a matching originating country. Topic Modeler, Phrases, and Companion Phrases A ready-made Topic Modeler process is used in the app to identify topics, phrases, and companion phrases. This process consists of Python code running in a Spark processor that applies AI and NLP to analyze each mention, and to identify recurring phrases and categorize them into topics. For example, the phrases “bitcoin,” “btc,” and “Ethereum” might be categorized under the topic “cryptocurrency.” (See Topic Modeling for additional details on Syntasa’s implementation.) 60 Project SmartFi: Exploring AI/ML for FinTech News The snippet is first cleansed using regular expressions to ensure the snippets processed by the topic modeler consist of only alphanumeric characters and spaces. Given the Trusted Domains sources do not include social media, the parameter to include hashtags for analysis was set to disabled. Mentions with short, nonsensical, and/or unrelated text are automatically discarded by the topic modeler. As with observations discussed regarding the noise filter, visual analysis in the SmartFi Trusted Domains - Narrative Looker dashboard of the most recent 30 days of mentions revealed several irrelevant and/or undesirable topics. A series of “stop” words were provided to the topic modeler to suppress these: the, this, an, that, do, these, is, has, have, was, had, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, continue reading, also read, rights reserved, privacy policy, not be, total views, use cookies. The resulting BigQuery table is an expanded view consisting of a row for every unique combination of a topic, phrase, and/or companion phrase associated with a particular snippet. Combine Finally, to facilitate analysis in a Looker dashboard, an SQL query in a BQ process is used to join the intermediary tables containing the Brandwatch data, themes, categories, regions, topics, phrases, and companion phrases into a single, unified table. The unique mention resource ID is referenced in the LookML to essentially collapse the expanded dataset back to ensure that each mention and associated metadata are accounted for only once during the dashboard analysis. Activate Initially, one week of Brandwatch data was ingested; processed to ensure that the pipeline was operating properly; and analyzed in the Looker dashboard to identify data quality issues such as sources of noise. After updating the noise filter and stop words, the process was repeated for the most recent 30 days, and then expanded even further to incorporate mentions from the last five years (January 1, 2018 to current day) for historical analysis. The last step taken was to enable a scheduled job to automatically ingest and process new Brandwatch data once a day to allow continued analysis moving forward. Learning Outcomes and Future Considerations 61 APPENDIX E SmartFi – Uncertain Domains Technical Details For the SmartFi - Uncertain Domains solution, the SmartFi - Trusted Domains Brandwatch query was modified (see SmartFi - Trusted Domains). The same keywords, location filter, and language were used. Data sources include social media (Twitter, Facebook, Reddit, Tumblr, YouTube), blogs, forums, and news websites. However, unlike with the SmartFi - Trusted Domains, which focused exclusively on the Trusted Domains, the advanced filtering in the SmartFi - Uncertain Domains query was modified to explicitly exclude results from Trusted Domains. The final SmartFi - Uncertain Domains dataset in Brandwatch consists of about 83M mentions in English from about 6M unique authors found worldwide from December 1, 2022 through February 28, 2023. Ingest Brandwatch Dataset The ready-made Brandwatch API process included with Syntasa was configured with the Brandwatch Uncertain Domains query ID to ingest the dataset into BigQuery at an ~1.85 percent sample rate—the maximum Brandwatch given the data set volume—via Brandwatch’s commercial API. The metadata fields remain the same as described in the SmartFi - Trusted Domains app. 62 Project SmartFi: Exploring AI/ML for FinTech News Themes In addition to the Brandwatch data, the Themes and Keywords were also ingested, as described in the Trusted Domains app. Twitter Full tweet text was retrieved directly from the Twitter API for all tweet IDs included in the Brandwatch data set through a Spark processor with custom Python code that leverages off-the-shelf libraries such as Requests, Pandas, and JSON. The tweet text is then inserted into the Brandwatch data set as the mention snippet in a second Spark Processor. Process Themes As with the SmartFi - Trusted Domains app, the same BQ process was used to label themes associated with each mention based on matching keywords. Topic Modeler, Phrases and Companion Phrases Visual analysis in the SmartFi Uncertain Domains - Narrative Looker dashboard of the most recent 30 days of mentions revealed several irrelevant and/or undesirable topics. No noise filter was implemented in the app. However, a series of stop words were provided to the topic modeler to suppress these: the, this, an, that, do, these, im, is, has, have, was, had, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, amp, rt, follow, retweet, tweet, quote, comment, the, a, this, an, that, do, these, im, i, is, has, have, was, had, huh, th, else, did, http, https Combine Finally, to facilitate analysis in a Looker dashboard, an SQL query in a BQ process is used to join the intermediary tables containing the Brandwatch data, themes, topics, phrases, and companion phrases into a single unified table. The unique mention resource ID is referenced in the LookML to essentially collapse the expanded dataset back to ensure that each mention and the associated metadata are accounted for only once for dashboard analysis. Learning Outcomes and Future Considerations 63 Activate Initially, one day of Brandwatch data was ingested, processed, and analyzed in the Looker dashboard to ensure that the pipeline was operating properly and to identify sources of noise. After updating the noise filter and stop words, the process was repeated for the most recent seven days and then expanded even further to incorporate mentions from December 1, 2022 to the current day for historical analysis. As with the SmartFi - Trusted Domains, the last step taken was to enable a scheduled job to automatically ingest and process new Brandwatch data once a day to allow continued analysis moving forward. 64 Project SmartFi: Exploring AI/ML for FinTech News APPENDIX F SmartFi – Chinese Language Technical Details For the SmartFi - Chinese Language solution, the SmartFi - Uncertain Domains Brandwatch query (SmartFi - Trusted Domains) was modified. The same location filter—Worldwide—was used. However, the language was limited to Chinese and the Simplified Chinese keywords were used in place of the English terms. Again, data sources include social media (Twitter, Facebook, Reddit, Tumblr, Youtube), blogs, forums, and news websites. Trusted Domains were not excluded. The final SmartFi - Chinese Language app dataset in Brandwatch consists of ~69K mentions in Chinese from ~17K unique authors found worldwide on February 7, 2023. Ingest Brandwatch Dataset The ready-made Brandwatch API process included with Syntasa was configured with the Brandwatch Chinese Language query ID to ingest the dataset into BigQuery at an ~37.5 percent sample rate—the maximum Brandwatch provided given the data set volume—via Brandwatch’s commercial API. The metadata fields remain the same as described in the SmartFi - Trusted Domains app. Learning Outcomes and Future Considerations 65 Themes Themes and Keywords were also ingested as described in the Trusted Domains app. Twitter As with the SmartFi - Uncertain Domains, the full tweet text was retrieved directly from the Twitter API for all tweet IDs included in the Brandwatch data, and inserted into the Brandwatch dataset as the mention snippet. Process Themes, Topics, Phrases and Companion Phrases Processing for themes, topics, phrases, and companion phrases occurred the same as in the SmartFi - Uncertain Domains app. Translation To facilitate theme assignment and topic modeling, the snippet text was translated into English using a ready-made Translate process which uses a pre-trained Opus-MT model available for download on Hugging Face https://huggingface.co/Helsinki-NLP/opus-mt-zh-en. See Chinese Translation for additional details on Syntasa’s Chinese to English translation implementation. Combine To facilitate analysis in a Looker dashboard, the SQL query used to join the intermediary tables in the SmartFi - Uncertain Domains was modified to include both the original Chinese snippet and the translated-into-English snippet. Activate Only one day (February 7, 2023) of Brandwatch data was ingested, processed, and analyzed in the Looker dashboard, to ensure that the pipeline was operating properly and to allow for proper evaluation. 66 Project SmartFi: Exploring AI/ML for FinTech News