Project SmartFi
Exploring AI/ML for
     FinTech News
     IN COLLABORATION WITH
     SYNTASA, POWERED BY
     GOOGLE CLOUD




                     Powered by
ABSTRACT

The World Bank Finance and Technology Department, in collaboration with
The World Bank Technology and Innovation Lab, partnered with Google Cloud
and Syntasa Inc. to learn how artificial intelligence and machine learning
could enhance the news sourcing and sentiment of FinTech topics globally.
This outcome report shares the key learnings and insights as a part of the
exploration and development of a prototype.




ACKNOWLEDGEMENTS

The key learnings outlined in this report were prepared by the Project SmartFi
(Smart Finance) team.
World Bank Treasury Finance and Technology (TREFT): Paul Snaith, Patrick
Cheng, Jaskaran Singh
World Bank Technology and Innovation Lab (ITSTI:) Yusuf Karacaoglu, Stela
Mocan, Mora Farhad, Mahesh Chandrahas Karajgi, Oleksandra Postavnicha,
Yujuan Sun
World Bank Corporate Procurement: Sanjay Colaco, Shweta Mesipam
Syntasa Incorporated: Shawn Zargham, Michael Finn, Kyle Witt, James Wilson,
Eric Bugin, Kareem Sharaf, Ted Blake
Google Cloud: Ryan Wright, Rajat Gupta
Contents
Abbreviations and Acronyms  v



Section 1: Overview  1
   Executive Summary  1
   Project Background  3
   Project Team & Sponsor  4



Section 2: Exploration with Artificial Intelligence for
Financial News  5
   Research Approach  5
   Business Challenge Scope  6



Section 3: Collaboration with Google Cloud and Syntasa  9
   Rapid Prototyping with Technology Partners  9
   Solution Overview and Key Results  14
   Technical Approach (Syntasa)  22



Section 4: Learning Outcomes and Future Considerations  37
   Technical Learnings for World Bank  37
   Business Learnings and Outcome  42

Appendix A: Narrative Dashboard Features  46

Appendix B: Reference Data  50

Appendix C: Brandwatch  55

Appendix D: SmartFi – Trusted Domains Technical Details  58

Appendix E: SmartFi – Uncertain Domains Technical Details  62

Appendix F: SmartFi – Chinese Language Technical Details  65
     FIGURES AND TABLES

     Table 2.1  6

     Figure 3.1: Syntasa Solution  10

     Figure 3.2: Modeled Mentions  16

     Figure 3.3: Word Cloud  17

     Figure 3.4: Domain Source  18

     Figure 3.5: Domain and PDF Sourcing  19

     Figure 3.6 : Trending Topics  20

     Figure 3.7: Sentiment Validation  21

     Figure 3.8: Sentiment Model Explainability   21

     Figure 3.9: Solution Architecture  23

     Figure 3.10: Data and AI pipeline  24

     Figure 3.11: Chinese Language App Configuration  25

     Figure 3.12: Topic Modeling Parameters  27

     Figure 3.13: Dashboard Trending Phrases  28

     Figure 3.14: Sentiment Explainability  29

     Figure 3.15: Sentiment Validation  30

     Figure 3.16: Language Translation Performance  32

     Figure 3.17: PDF Sourcing  33

     Figure 3.18: Sentiment Explainability  34

     Figure 3.19: Solution Architecture  35

     Figure 4.1: Topic Modeling  38

     Figure 4.2: Topic Modeling Explainer  39

     Table 4.1: Sentiment Analysis Models  40




iv                                           Project SmartFi: Exploring AI/ML for FinTech News
Abbreviations
and Acronyms
    Abbreviation         Description        Abbreviation        Description
             AI Artificial Intelligence          JSON JavaScript Object Notation
           API Application Programming             KPI Key Performance Indicators
               Interface                          LDA Latent Dirichlet Allocation
           App Application                        LLM Large Language Models
          AWS Amazon Web Services              LookML Looker Modeling Language
      BARD AI Google’s Generative AI Tool           ML Machine Learning
         BERT Bidirectional Encoder               NLP Natural Language
              Representations from                    Processing
              Transformers
                                                  NMF Negative Matrix
            BI Business Intelligence                  Factorization
           BQ Big Query                           OCR Optical Character
      ChatGPT Open AI’s Generative AI                 Recognition
              Tool                                POC Proof of Concept
          DLP Data Loss Prevention                 PoV Proof of Value
           ETL Extract Transform Load        RoBERTa Variant of BERT model
     FedRAMP Federal Risk and                     RPA Robotics Process
             Authorization Management                 Automation
             Program
                                                  Saas Software as a Service
        FinTech Finance and Technology
                                               SmartFi Smart Finance
           FTX Futures Exchange
                                                  SME Subject Matter Expert
          GCP Google Cloud Platform
                                                 TI Lab World Bank Technology and
           IAM Identity Access                          Innovation Lab
               Management
                                                  TRE Treasury
            IoT Internet of Things
                                                TREFT World Bank Treasury
         ITSTI World Bank Group                       Financial Technology unit
               Technology and Innovation
                                                     UI User Interface
               Lab
                                                  VPC Virtual Private Cloud



﻿                                                                                   v
SECTION


  1              OVERVIEW




  Executive Summary
  In today’s fast-paced world, it can be challenging to stay informed on the latest
  financial technology news and trends, which can help to inform decisions for
  financial and operational strategies. The amount of information and opinions
  available on the internet can be overwhelming, and it can be challenging to
  filter out what is most relevant and important for business users. Technology
  is constantly evolving; new trends and developments may emerge daily. To
  address this challenge, the World Bank Treasury Financial Technology unit
  (TREFT) and the World Bank Group Technology and Innovation Lab (ITSTI)
  (hereafter “project team”) worked on a framing exercise to explore how
  emerging technologies could provide a solution to help users with access to
  curated, trusted, and relevant news sources that inform them of sentiments
  across trending topics.

  The ITSTI lab follows a structured approach using design thinking
  methodologies to understand the needs, wants, and pain-points of end users.
  The project team identified a sample list of the key topics and terms of interest;
  various trusted sources (including open source and subscription content, and
  social media channels); and the geographic areas of interest, to help guide the
  data requirements. The team also conducted market research to understand
  how similar problems are being solved, and to build on the in-lab knowledge.
  Throughout this research, we worked with the largest search provider, Google
  Cloud. The Google Cloud Platform (GCP) provides a range of tools and services
  that are helpful in using machine learning to source news—for example cloud
  natural language API to extract entities, sentiments, and insights from news
  articles—among many other capabilities. We also worked with Google Cloud’s
  partner company, Syntasa Inc., which specializes in sentiment analytics,
  generating insights through data analytics, and understanding digital behaviors
  to customize solutions for business users.



  Overview                                                                             1
    With Syntasa, which is powered by Google Cloud, we collaborated on designing
    and creating a prototype of a dashboard that provides users with the ability
    to gain insights into sentiment trends so that behavior shifts can be quickly
    identified by topic and by region. The visualization tool we created also
    provides flexibility in customizing filters, to enable quick access to digestible
    FinTech topics that can help users stay up to date with the latest trends
    and developments in their industries; identify new opportunities; and make
    informed decisions.

    Our collaboration provided the project team with the opportunity to not
    only explore potential solutions but also to learn from Syntasa how private
    technology firms blueprint and develop artificial intelligence (AI) and machine
    learning (ML) prototypes to scale into enterprise adoption. The World Bank
    Technology and Innovation Lab (TI Lab) technical team worked closely with
    Syntasa and Google Cloud to learn how data scientists build custom AI/ML
    models, and test them for accuracy and explainability regarding transparency,
    accountability, and compliance, and to ensure that AI systems are fair, ethical,
    and safe to use. This report outlines the technical learnings, value drivers, and
    capabilities of the solution we developed.




                                                              Siphosethu Fanti/peopleimages.com




2                                           Project SmartFi: Exploring AI/ML for FinTech News
Project Background
The World Bank’s Treasury Operations, Financial Technology unit (TREFT)
helps lead the treasury’s technological advancement initiatives from the
ideation phase through development, and successful implementation in close
partnership with the treasury business units and technology developers.
TREFT actively engages with the Bank’s business units on identifying and
implementing suitable technical solutions for business use cases in treasury
operations, and their potential development and implementation through
in-house and/or off-the-shelf solutions. Such a process requires a constant
review of the Bank’s internal technology capabilities and comparison with
existing industry standards and new market developments. Consequently,
it is immensely important for TREFT to selectively monitor new technology
trends and solutions, and subsequently to determine their suitability for
the improvement of treasury operations. Currently, this process is being
largely performed manually, with a considerable amount of personnel time
and resources being dedicated to it on a regular basis. Some of the current
challenges include:

•	   Manual sourcing and consolidation of the most relevant and informative
     FinTech news and events is tedious.
•	   Keeping track of market discussions and public sentiment surrounding
     notable FinTech topics and events.
•	   Limited search scope in terms of news sources, given the time and
     resource constraints.
•	   Determining the authenticity of a news source, its thematic relevance, and
     potential topical categorization.

In order to tackle these challenges and to systematically harmonize the process
of FinTech and technology news sourcing, TREFT sees a unique opportunity
to explore an AI system that mimics human methods in order to quickly and
efficiently source curated news relevant to the topics of interest for a specific
business unit. A related opportunity comes with automating the process of
quantifying relevance, measuring sentiment, and determining the bias of news
after it has been sourced. This can be accomplished by mirroring human tactics
for measuring how relevant an article is, and determining its overall sentiment
and bias, a process which can also be supported through AI methods.




Overview                                                                            3
    Given the existence of these opportunities and the potential benefits of
    deploying such an AI solution to multiple use cases within treasury, TREFT,
    along with its partner, Innovation Lab, collaborated in exploring in-house and
    off-the-shelf solutions which could fulfill the requirements of the use case.




    Project Team &
    Sponsor
    TREFT coordinates the efficient internal administration of the World Bank
    Treasury’s Information Technology infrastructure across all institutional
    projects, maintenance, and budget and planning cycles, ensuring that it
    remains fit for purpose, up-to-date, secure, and reliable. The unit also develops
    and maintains appropriate strategic technology planning in relation to
    Treasury’s significant standing in the global financial markets, and leverages
    that standing to build internal and external partnerships for market and
    development effect. TREFT’s technology initiatives include leading Treasury’s
    participation in large-scale system renewals and emerging technology projects
    in FinTech fields such as AI/ML, blockchain, RPA, and World Bank finance-
    wide projects.

    The TI Lab is a specialized unit within the World Bank Group’s Information and
    Technology vice presidency, centered around three main pillars: innovation,
    experimentation, and capacity building. TI Lab works closely with various
    departments and units within the World Bank Group, as well as with external
    partners, to identify potential areas where emerging technologies can be
    applied to solve business and development problems. It aims to assist
    World Bank Group (WBG) business teams in problem framing, requirement
    gathering, data preparation, technical guidance, and prototype delivery to
    help decision makers assess whether an investment is worth embarking on for
    operationalization. The mandate in the TI Lab is to learn by doing and to share
    knowledge across teams, for continuous innovation.




4                                          Project SmartFi: Exploring AI/ML for FinTech News
SECTION
                     EXPLORATION WITH ARTIFICIAL INTELLIGENCE

 2                   FOR FINANCIAL NEWS




  Research Approach
   1	 What are the most effective methods for collecting and curating news
      articles related to a specific topic or set of topics?

  2	 How accurate and reliable are existing sentiment analysis models for
     analyzing news articles, and what types of customizations or training are
     needed to improve their performance?

  3	 How do different sources of news articles (social media, traditional
     news outlets, blogs) vary in terms of their sentiment and relevance to
     specific topics?

  4	 What are the most effective methods for visualizing and presenting
     sentiment analysis results to users, and how can these be customized to
     meet the needs of different stakeholders?

  5	 How can sentiment analysis be used to identify trends and emerging topics
     in a specific industry or field, and what types of insights can be gained from
     this analysis?

  6	 What are the ethical and legal implications of using sentiment analysis to
     curate and analyze news articles, and how can these be addressed in the
     development and implementation of the solution?

  7	 How do different user groups (analysts, executives, investors) use curated
     news and sentiment analysis, and what other features and functionalities
     can be important to these users?




  Exploration with Artificial Intelligence for Financial News                         5
                  Business Challenge
                  Scope
                  The scope of the PoC was determined by the project team in collaboration with
                  Syntasa. Foundational data and base material was provided as inputs to the
                  Syntasa team as detailed below:

                  Relevant topics of interest to TREFT business operations were provided to
                  Syntasa in the form of a holistic Excel document with the following structure.
                  Major themes were developed, and various subtopics were categorized into the
                  themes, which then formed the pool of relevant FinTech and technology-related
                  keywords. To provide additional filter mechanisms and take into account the
                  geographical relevance of the topics, an additional list of geographic locations
                  and regions was provided, with the theme subtopics yielding more specific and
                  relevant search results. A brief example of the structure of the inputs can be
                  seen seen in Table 2.1, and a detailed overview is provided in Appendix B.



    TABLE 2.1

     Theme                  Asset Tokenization                   Digital Currency                           Web3


                      •	 Fungible tokens                  •	 CBDC (Central Bank Digital        •	 Blockchain
                                                             Currency)
                      •	 ICO (Initial Coin Offering)                                           •	 Cryptocurrency
                                                          •	 Delivery versus Payment
                      •	 NFT (Non-Fungible Tokens)                                             •	 DApps (Decentralized Apps)
                                                             (DvP)
                      •	 Programmable Money                                                    •	 DLT (Distributed Ledger
       Keywords




                                                          •	 Digital Assets                       Technology)
                      •	 Programmable Payments
                                                          •	 Digital Wallet                    •	 Decentralized Autonomous
                      •	 Carbon tokenization
                                                          •	 Stablecoin                           Organizations (DAOs)
                      •	 Security Tokens Offering
                                                          •	 FOMO (Fear of Missing Out)        •	 Decentralized Finance
                         (STO)
                                                                                                  (DeFi)
                                                          •	 Instant Payment
                                                                                               •	 Interoperability




                                        List of Regions                                       List of Domains


     Filters          (North America, South America, Europe, MENA,            (federalreserve.gov, ecb.europa.eu,
                      Asia, etc.)                                             bankofcanada.ca, mas.gov.sg, imf.org, etc.)




6                                                                 Project SmartFi: Exploring AI/ML for FinTech News
Value Proposition

The following are value-drivers for the proposed solution:

•	   Stay informed on industry trends and news: Allows users to stay up-to-
     date on the latest news and developments in the finance and technology
     industries, including emerging trends and topics.
•	   Gain insights into sentiment trends: Allows users to quickly identify shifts in
     sentiment towards specific topics or companies, providing valuable insights
     into market trends and sentiment.
•	   Monitor Partners: Users could track news and sentiment around member
     countries, NGOs, commercial banks, and other partners, enabling them to
     stay informed on their actions and strategies.
•	   Make data-driven decisions: Accurate and reliable sentiment analysis on
     desired topics to help users make data-driven decisions based on real-
     time insights.
•	   Save time and resources: Users can save time and resources that would
     otherwise be spent searching for and analyzing news articles manually.

Capabilities that could be included in the dashboard to support these value
drivers include:

•	   Customizable news feeds: Users could customize their news feeds to only
     show news articles related to specific topics or keywords, ensuring that they
     only see relevant content.
•	   Sentiment analysis: Flexibility to filter by sentiment on specified topics
     or across geographic landscape to understand how different regions or
     industries react to fintech
•	   Real-time updates: Users may adjust the time horizon to understand how
     topics in fintech have evolved over time or receive alerts in real time.
•	   Customizable alerts: Users could set up alerts to notify them of changes in
     sentiment or news related to specific topics or companies, enabling them to
     stay informed without constantly monitoring the dashboard.
•	   Integration with other tools: The dashboard could be integrated with other
     tools, such as trading platforms or financial analysis tools, allowing users
     to make data-driven decisions directly from the dashboard. Possibility of
     integrating generative AI in future.




Exploration with Artificial Intelligence for Financial News                            7
    By incorporating these value drivers and capabilities, a dashboard that shows
    finance and technology-related news with sentiment analysis could provide
    valuable insights and result in time savings for its users.




                                                                  Donson/peopleimages.com




8                                        Project SmartFi: Exploring AI/ML for FinTech News
SECTION
                   COLLABORATION WITH GOOGLE CLOUD

 3                 AND SYNTASA




  Rapid Prototyping with
  Technology Partners
  Add content on the motivation to learn from the Google Cloud Platform (GCP)
  platform, and on designing a prototype solution with a technology partner.


  About Syntasa

  Syntasa is a cloud-based data and AI platform that enables users to connect
  various data sources, build and deploy customized AI/ML models, and activate
  them across various channels through dashboards, data shares, and APIs.
  This tool provides users with visibility into the full data pipeline, including
  data source, dependencies, and how the data is being used to drive insights.
  The Syntasa platform is built with leading open-source technologies, and is
  powered by GCP services.

  The Syntasa platform uses the concept of apps (along with the sequencing
  of those apps) to accelerate time-to-value; improve reliability and efficiency;
  and provide significant return on investment over home-grown cloud-based
  solutions. The apps provide low or nocode to full-code capabilities, which allows
  business users, analysts, data scientists, and data engineers to collaborate,
  and to leverage and share their expertise.

  The Syntasa platform runs natively in an organization’s GCP with the data
  stored in Google Cloud storage and BigQuery. Organizations can keep their
  sensitive data inside their virtual private cloud (VPC) and behind their firewall,
  thus maintaining full control, while leveraging the power of advances in big
  data processing and AI/ML that are being provided by Syntasa and Google
  Cloud services.


  Collaboration with Google Cloud and Syntasa                                          9
     FIGURE 3.1: Syntasa Solution




                                                        By C Malambo/peopleimages.com




10                                  Project SmartFi: Exploring AI/ML for FinTech News
Syntasa’s capabilities make it a powerful tool for rapid prototyping, enabling
users to quickly iterate and refine prototypes based on real-time data and
insights. Benefits include:

•	   Rapid prototyping from low-code          •	   Integrated production data +
     drag-and-drop interface and                   feature + activation pipelines
     full-code interface                      •	   Collaboration, version
•	   Native support for GCP                        control, and an automated
     infrastructure                                documentation framework
•	   Apache Spark and Kubernetes              •	   Advanced job definition,
     runtime support                               scheduling, and management
                                                   capabilities with job failure alerts
•	   Templatized integrations,
     processes, and apps to enable            •	   Data quality monitoring with
     consistency and code reuse                    visibility into data provenance
                                                   and lineage
•	   Scalable data and AI app
     framework for development and            •	   Business alerting and model
     production                                    performance monitoring




About the Google Cloud Platform

GCP is a suite of cloud computing services offered by Google. It runs on the
same infrastructure that Google uses internally for its end-user products, such
as Google Search, Gmail, Google Drive, and YouTube.
GCP offers a scalable range of computing services
such as computing services, networking, storage
services, big data, security and identity management,
management tools, cloud AI, IoT (Internet of Things)
and more. Some examples of GCP services are:
Compute Engine, App Engine, Kubernetes
Engine, Cloud Functions, Cloud Run, Cloud
Storage, Cloud SQL, BigQuery, Cloud Pub/Sub
& TensorFlow services.




Collaboration with Google Cloud and Syntasa                                               11
                                      Global Network

     Google Cloud has a worldwide presence. Google’s global network, connected
     via high speed cables, makes data movement across the globe in a highly
     performant and secure manner. Google Cloud offers FedRamp moderate
     cloud services in Google Cloud data centers around the world, which gives
     organizations the ability to move data securely and compliantly from one
     part of the world to another in order to meet key objectives such as data
     backup requirements.



                                         BigQuery

     BigQuery is Google Cloud’s planet-scale, completely serverless, and cost-
     effective enterprise data warehouse that works across clouds and scales
     with your data. With BigQuery, Google has separated compute storage, and
     connected via the Petabit network, allowing for the compute and storage
     functions to expand vertically and independently of each other. This allows
     users to leverage as many compute slots as necessary to answer a query; as
     a result, BigQuery offers measurable performance gains compared to other
     analytical systems.

     •	   BigQuery Omni: Google gives organizations the ability to leverage
          BigQuery even if users are housing data with other cloud service providers,
          or on-premise with BigQuery Omni. When users deploy BigQuery Omni,
          they are able to query data that is stored on-premise—for example in
          Microsoft Azure or AWS in a tabular format—as if the data were being
          stored in a Google Cloud BigQuery environment. This capability allows
          users to receive all the benefits of Google BigQuery without requiring them
          to move the data across public clouds.
     •	   Data Governance: BigQuery allows for row-level and column-level security
          as well as other IAM-based permissions at the table and dataset levels.
          Combined with a DLP solution, BigQuery is one of the most extensible and
          secure solutions in the cloud today, and these data governance capabilities
          can also be applied to other clouds via BigQuery Omni.



                                        Translation

     Google Cloud offers out-of-the-box (OOTB) translation capabilities that
     allow translation in 100+ languages. These translations do not require any
     pretraining, and are available as APIs to be consumed. These translations are



12                                          Project SmartFi: Exploring AI/ML for FinTech News
some of the highest quality translations in the industry. Today Google offers
both text and document translation capabilities. We believe that this will allow
the World Bank to meet the needs of its global audience effectively.



                                       DocumentAI

DocumentAI is another differentiator for Google Cloud. It allows for OCR and
Key Value pairs from documents with the highest fidelity. and works particularly
well with handwritten documents.is the suite includes Document Warehouse,
which is a hosted repository of documents. Document AI and Document
Warehouse are going to be the earliest targets for introducing large language
models (LLMs), which will allow a unified cloud search experience, along
with natural language processing (NLP)-based offerings like summarization,
and chatbots.



                                          Looker

Looker is Google’s cloud-based data exploration, discovery, and data analytics
platform. Key information is typically stored in a number of different data stores,
each with their own schemas and access processes. Looker provides discovery
and real-time analysis of data across multiple data stores, which is critical in
understanding disparate information from a business and technical perspective.

•	   Looker strikes a balance between governance and self service in the
     deployment of analytics. This scalable, real-time approach prevents data
     sprawl and duplication headaches, including the common issue of having
     multiple versions of the same business intelligence (BI )reports and
     dashboards. Looker is capable of presenting dashboards and reports within
     the application, embedded in portals, and via third-party BI tools such
     as Tableau.
•	   Looker Blocks are free, reusable, and customizable OOTB templates that
     provide a head start in creating value from data. With Blocks, nontechnical
     users can quickly turn data into dashboards that can either be used as-is or
     be easily customized and blended with other data to meet specific needs.
     Blocks have been prebuilt to model and visualize a wide range of common
     use cases such as multicloud cost analysis, data warehouse log analysis,
     and much more. More than 150 Blocks are available for downloading from
     the Marketplace: https://marketplace.looker.com.




Collaboration with Google Cloud and Syntasa                                           13
     Solution Overview and
     Key Results
     The Syntasa Data and AI platform was utilized for this POC to demonstrate
     rapid prototyping of several sentiment analytics use cases in Google Cloud
     Platform (GCP). The platform simplifies the use of GCP cloud services for
     data scientists and analysts, allowing them to either code or visually build
     their apps. This helps users focus on constructing their data and AI pipelines
     using familiar user interfaces like Jupyter Notebook or Syntasa’s low/no code
     workflow processes.

     The POC involved the creation of six Syntasa apps and Looker dashboards.
     These apps and dashboards explored a wide range of data and AI capabilities,
     including data ingestion, topic modeling, sentiment analysis, language
     translation, trend analysis, and AI explainability. The apps and dashboards
     covered the following use cases:

     •	   Trusted Domains                       •	   PDF Sourcing
     •	   Uncertain Domains                     •	   Trend Analysis
     •	   Chinese Language                      •	   Sentiment Explanation




                                                                          Funtap/Adobe Stock




14                                         Project SmartFi: Exploring AI/ML for FinTech News
Key Results

The key results obtained and demonstrated through dashboards, analysis, and
discussions are:

•	   World Bank Group (WBG) domain experts can gain deeper and quicker
     insights into their subject areas of interest by leveraging automated
     AI/ML technologies.
•	   WBG domain experts can focus their efforts by using customized narrative
     and topic modeling apps, and dashboards tailored to their needs by defining
     themes, keywords, data sources, languages, categories, and geographies
     of their choice.
•	   Sentiment analytics solutions that leverage large language models (LLMs)
     can classify positive and negative sentiment with greater than 85 percent
     accuracy when compared to manually classified relevant text.
•	   Google Translate APIs outperform open-source models by a wide
     margin, with 96 percent of translations done by Google Translate being
     deemed acceptable.
•	   Automated data and AI pipelines can extract full PDF reports from trusted
     sites and apply AI-based summarization and topic modeling to help WBG
     experts track the latest developments in their topics of interest.
•	   Trend analysis can be fine-tuned to the needs of the WBG team to detect
     and alert users to rising and falling topics, and to highlight emerging high-
     visibility events such as the FTX collapse and the Silicon Valley Bank failure.




For more details on app configuration and dashboard usage please refer to
Technical Approach (Syntasa).



Collaboration with Google Cloud and Syntasa                                            15
     SmartFi - Trusted Domains

     The SmartFi - Trusted Domains app was developed to analyze and provide
     insights from the “SmartFi” content that is available on trusted domains. The
     app connects to the Brandwatch API; extracts relevant text; loads data into the
     GCP; filters and transforms the text for topic modeling and sentiment analysis;
     adds WBG-defined themes; and prepares an analysis-ready dataset for the
     trusted domain narrative dashboard. More than five years of historical data has
     been processed, and the production pipeline is updated daily.

     The SmartFi - Trusted Domains dashboard, built using Google’s Looker,
     provides comprehensive visibility into FinTech-related articles about and
     conversations on trusted domains. Users can analyze key performance
     indicators (KPIs) and time series charts, and can drill down to original news or
     social media mentions. The dashboard includes filters for geographic regions,
     trusted domain categories, domain URLs, and sentiment, allowing for granular
     analysis of specific regions or categories. The screenshot shown in Figure 3.2
     shows a comparison of activity, sentiment, and trends, based on the categories
     defined by the project teams.

     FIGURE 3.2: Modeled Mentions




     For more technical details on data sources, topic modeling see
     Technical Approach (Syntasa) and for more details on the narrative dashboard
     see Appendix A.




16                                          Project SmartFi: Exploring AI/ML for FinTech News
SmartFi - Uncertain Domains

The SmartFi - Uncertain Domains app focuses on all domains that are not
included in the trusted domain list. The workflow consists of multiple steps
similar to the ones in the Trusted Domains app, with the addition of a process
that uses the Twitter API and Twitter IDs to extract Tweet texts for topic
modeling. The data extraction is sampled at 2 percent, and over 3 months of
historical data has been processed. The production pipeline is updated daily.

The SmartFi - Uncertain Domains dashboard offers comprehensive visibility into
FinTech-related articles on and conversations about websites and social media
platforms beyond the trusted domains. Users can analyze the impact of major
events, such as the FTX and Silicon Valley Bank collapses, and can explore
discussions with and without hashtags. The right panel of Figure 3.3 shows the
phrases that were present when authors mentioned cryptocurrency exchange.

FIGURE 3.3: Word Cloud




For more technical detail on data sources and topic modeling, see
Technical Approach (Syntasa); and for more detail on the narrative dashboard
see Appendix A.



Collaboration with Google Cloud and Syntasa                                      17
     SmartFi - Chinese Language

     The SmartFi - Chinese Language app was created to demonstrate the
     translation capabilities of Google Cloud, and compare them with open-source
     translation routines. The workflow consists of multiple steps, similar to those in
     the Trusted Domains app, with the addition of multiple translation processes.
     Only one day of Chinese language mentions (a little less than 1M mentions)
     were processed, and the production pipeline was not activated.

     The SmartFi - Chinese Translation dashboard offers the same analytics abilities
     as the Uncertain Domains dashboard, but with a focus on Chinese language
     content. Users can explore and compare narratives expressed by authors
     in Chinese, with both the original and translated text displayed side by side.
     Figure 3.4 shows the domains, authors, and sample translated and original text.

     FIGURE 3.4: Domain Source




     For more detail on comparison of translation algorithms, see
     Chinese Translation.




18                                           Project SmartFi: Exploring AI/ML for FinTech News
SmartFi - PDF Sourcing

The SmartFi - PDF Sourcing app was created to demonstrate rapid prototyping
capability, exploring both website crawling and search API approaches for
automating the extraction of PDF reports from trusted sites. The search API
approach was found to be more targeted and efficient.

The SmartFi - PDF dashboard provides a faster way to acquire information from
trusted data sources, displaying links to PDFs, AI-generated summaries, and
topic modeling analysis of the Figure 3.5 below shows that over 5,000 PDF
documents were automatically downloaded and analyzed from the European
Central Bank site.

FIGURE 3.5: Domain and PDF Sourcing




For more detail on PDF sourcing implementation, see PDF Sourcing.



Collaboration with Google Cloud and Syntasa                                     19
     SmartFi - Trend Analysis

     The SmartFi - Trend Analysis app analyzes the output of the SmartFi Trusted
     Domain app to identify rising and falling topics and phrases. Users can
     customize trend analysis, for example by using a seven-day rolling average to
     smooth out daily fluctuations. The app has analyzed the trusted domain app
     output from October 2022 and is updated daily.

     The SmartFi - Trending Dashboard displays the results of the trend analysis,
     allowing users to detect and alert rising and falling topics ,and to highlight
     emerging high-visibility events, such as the FTX collapse and the Silicon Valley
     Bank failure. The left panel in Figure 3.6 shows the top five topics/phrases by
     volume, and the right panel shows the top five rising topics/phrases on Nov 10
     2022. As can be seen, a day before the FTX collapse on Nov 11 2022, Alameda
     Research was the top rising phrase in the trusted sources data.

     FIGURE 3.6 : Trending Topics




     SmartFi - Sentiment Models and Explainabilty

     The SmartFi - Sentiment Explanation app was created to address two research
     questions: 1) comparison of different sentiment analysis models; and 2) analysis
     of gender and race bias. The FinBERT model was found to be over 85 percent
     accurate for positive and negative sentiment classification when compared to
     manually classified relevant text.

     The SmartFi - Sentiment validation dashboard provides an in-depth view of
     the sentiment analysis, allowing users to explore the performance of different
     sentiment models, such as the FinBERT model, and Google’s AutoML. The
     middle panel in the Figure 3.7 below shows that the FinBERT model was over
     85 percent accurate for financial text.



20                                          Project SmartFi: Exploring AI/ML for FinTech News
FIGURE 3.7: Sentiment Validation




For more detail on PDF sourcing implementation, see Sentiment Analysis.



The dashboard also enables users to analyze potential gender and race biases
in sentiment classification, providing insights into ensuring unbiased analysis of
financial narratives.

FIGURE 3.8: Sentiment Model Explainability




For more detail on the model explainability, see Trustworthy and Explainable AI.



Collaboration with Google Cloud and Syntasa                                          21
     Technical Approach
     (Syntasa)
     Data Sources and Preparation


                                    Reference Data

     Working in collaboration with Syntasa, the World Bank Group (SBG) provided
     a number of parameters to help scope this project, facilitate data collection,
     and ensure alignment with WBG business objectives. These data were defined
     by the WBG in a spreadsheet that included themes and keywords related to
     financial technology in both English and Chinese; a prioritized list of online
     news and media websites referred to as Trusted Domains; and geographic
     regions of interest. (For more detail see Appendix B: Reference Data.)



                               SmartFi - Trusted Domains

     The goal of the SmartFi - Trusted Domains solution is to extract meaningful
     insights from the WBG Trusted Domains. The SmartFi - Trusted Domains app
     contains the pipeline that was created in Syntasa to ingest and process the
     underlying data needed to accomplish this. The app includes a combination of
     ready-made and custom processes to ingest the Brandwatch Trusted Domains
     dataset, and the WBG themes, categories, and regions into BigQuery; then
     process each mention to mitigate noise; apply the predefined WBG themes,
     categories, and regions; and extract topics, phrases, and companion phrases.
     Finally, the data is combined into a single curated dataset used for analysis
     in Looker. Figure 3.9 shows the data and AI pipeline for the SmartFi – Trusted
     Domain app configured in the Syntasa Platform. (For more details see
     Appendix D: SmartFi – Trusted Domains Technical Details.)




22                                         Project SmartFi: Exploring AI/ML for FinTech News
FIGURE 3.9: Solution Architecture




                             SmartFi - Uncertain Domains

The SmartFi - Uncertain Domains app contains the pipeline that was
created in Syntasa to ingest and process the underlying data needed to
extract meaningful insights from sources, explicitly excluding the WBG
Trusted Domains. A lighter version of the SmartFi - Trusted Domains app,
this app includes a combination of ready-made and custom processes to
ingest the Brandwatch Uncertain Domains dataset and the WBG themes
into BigQuery; apply predefined WBG themes; and then extract topics,
phrases, and companion phrases. Licensing restrictions prevent Brandwatch
from providing any Twitter tweet text via the Brandwatch API, so the app
also retrieves the full tweet text directly from the Twitter API. Finally, the
data is combined into a single curated dataset used for analysis in Looker.
Figure 3.10 shows the data and AI pipeline for the SmartFi – Uncertain
Domain app configured in the Syntasa Platform. (For more detail see
Appendix E: SmartFi – Uncertain Domains Technical Details.)




Collaboration with Google Cloud and Syntasa                                      23
     FIGURE 3.10: Data and AI pipeline




                              SmartFi - Chinese Language

     The SmartFi - Chinese Language app contains the pipeline that was
     created in Syntasa to ingest and process the underlying data needed to
     extract meaningful insights from the Chinese mentions. The app functions
     identically to the SmartFi - Uncertain Domains app, with the addition of a
     ready-made translation process to translate Chinese snippet text into English.
     Figure 3.11 shows the data and the AI pipeline for the SmartFi – Chinese
     Language app configured in the Syntasa platform. (For more detail see
     Appendix F: SmartFi – Chinese Language Technical Details.)




24                                         Project SmartFi: Exploring AI/ML for FinTech News
FIGURE 3.11: Chinese Language App Configuration




Topic Modeling

Syntasa has conducted topic modeling on social media and news texts to
bring to light the most dominant and frequent conversations contained within
them. The strategy is to start with a general subject area—for example, text
that contains keywords related to finance—and further breaks it down into
expert-defined themes (top-down) and AI-identified topics (bottom-up) for
quick discovery of the narratives that are being conversed. Syntasa’s focus is
on automating the clustering workflow so as to lower manual oversight, work
dynamically on either small or big data, automatically discard irrelevant text,
and preserve the most dominant clusters, which will also be self-named.

An unsupervised clustering approach is most useful because then the topics (or
classes) are not known beforehand. Likewise, developing a classifier through
clustering would not be a suitable solution because it would not be able to
discover new conversations as they appear in real time.

Some of the popular approaches to clustering involve algorithms such as
KMeans or LDA, which can be used to group similar sentences/text together,
but that have some downsides, especially with very diverse text. Algorithms


Collaboration with Google Cloud and Syntasa                                       25
     require knowing beforehand how many clusters it is optimal to create;
     otherwise the clusters start blending words that have no similarity to each
     other. Determining the optimal number of clusters (K) requires sampling
     many different Ks, and having the manual oversight needed to search for
     that number. There is also no guarantee that the sampling will include the
     optimal K of the text; rather, the analysis would select only the best K of the
     sampling. Therefore, searching for optimal K with manual oversight increases
     computational and labor costs. In the case of social media and news text,
     conversations can be diverse to the point where it becomes impractical to
     find the optimal K needed in order to try to force all of the text into respective
     clusters. Examining the contents of these clusters is usually done by pulling
     n-grams, bigrams, or trigrams, and an analyst manually determining the “topic”
     that is being discussed. Because sentences have long structures compared
     to n-grams, there will be a mixture of unrelated n-grams in a group that is
     supposed to summarize the cluster content.

     To overcome the problem of manually naming clusters based on n-grams
     that likely do not have similarity to each other, a novel approach had to
     be developed. Rather than focusing clustering at the sentence level, then
     examining n-grams, Syntasa developed a strategy that begins with a focus
     on the n-grams themselves. First, the highest-occurring n-grams are used
     as “topics,” and are used as the cluster centers are checked against other
     nontopic n-grams for similarity. This allows the topic to be self-named, and
     only similar phrases to be grouped together; this provides an obvious answer
     to the content of the cluster. Similarity checks between the n-grams are made
     by using a BERT embedder with cosine similarity. All of the text that does not
     get linked to a topic then gets discarded. The discarded text includes very short
     text that cannot form a valid n-gram; that contains a valid n-gram that does not
     occur often enough; or that contains a valid n-gram that is unrelated to the top
     topics. Discarding text is a desired side effect, because it does not contaminate
     the other text.




26                                           Project SmartFi: Exploring AI/ML for FinTech News
FIGURE 3.12: Topic Modeling Parameters




While Syntasa’s topic modeling results in rapid self-naming topics, the term
“topic” is itself a subjective term. To some analysts, a topic could be as high-
level as “crypto” or “banking,” but to others it might be more granular; for
example, “cryptocurrency exchange,” “blockchain technology,” or “smart
contracts.” The more granular the topic the more topics there will be; this can
overwhelm the analysis, but it can also be more informative about the narrative.
To preserve both the high-level and granular topics, Syntasa prepares the
data carefully for dashboards in order to give the user full filtering capabilities
with which he can narrow the conversation further. A user can start with the
larger topics that are generated and then delve into the various narratives
surrounding it.



Collaboration with Google Cloud and Syntasa                                           27
     FIGURE 3.13: Dashboard Trending Phrases




     Sentiment Analysis

     Syntasa conducted an experiment that involved exploring the use of additional
     NLP models for producing sentiment classifications and determining the
     agreement levels between the models and members of the World Bank
     Group. This was an effort to better the built-in sentiment model coming
     from Brandwatch, which proved to be a black box that made unreliable
     classifications.

     Two models from Hugging Face’s model repository were selected. First, there
     was a model trained on 124 million tweets that learned colloquial conversation;
     next, a model named FinBERT was trained to understand financial terminology.
     Both models proved to be good in their respective fields. The Twitter model
     could accurately identify positive or negative text, for example, in the context
     of reviewing products (in this case, crypto exchanges), whereas the FinBERT
     model did a better job of accurately classifying financial terms (for example
     “surged 27 percent,” or “$40bn implosion”). If a mix of colloquial talk and formal
     financial talk is to be collected in the future, an ensemble or combination of
     these models could be used to capture more of the text accurately.




28                                          Project SmartFi: Exploring AI/ML for FinTech News
FIGURE 3.14: Sentiment Explainability




In order to evaluate the accuracy of each model (Twitter, FinBERT, and
Brandwatch) we asked a WBG domain expert to manually score over 200
texts into negative, neutral, and positive sentiment classes. This enabled us to
determine the accuracy of each model using the WBG scores as ground truth.
Our model evaluation showed that:

•	   The Brandwatch sentiment model, at only 24 percent accuracy, was
     unacceptable, as we had seen in working with other clients.
•	   The Twitter roBERTa sentiment model was also unacceptable. It was only
     47 percent accurate; after some tuning (setting the confidence score to
     70 percent and above) we were able to increase the accuracy, but only to
     51 percent.
•	   The FinBERT model, on the other hand, started with 64 percent accuracy
     and after adjusting the confidence score to 70 percent, the accuracy was
     increased to 75 percent.




Collaboration with Google Cloud and Syntasa                                        29
     The FinBERT model was the only acceptable model we found. We also
     experimented with ensemble models, in which the answers of the Twitter or
     FinBERT model could supersede the other model based on higher confidence
     scores. This did not increase accuracy by a large factor, but it did slightly
     increase the number of records qualifying above the 70 percent confidence
     score. The Brandwatch model proved to be the least accurate, and it also did
     not have the ability to conduct bias tests or see confidence scores.

     The WBG expertalso classified the text in the experiment to indicate whether it
     was relevant or not. Greater than 75 percent of the text was deemed relevant.
     When focusing on WBG’s relevant text, only positive and negative sentiments,
     and a confidence score of greater than 60 percent, FinBERT produces an
     accuracy of 86 percent. While a “relevant” classification will not be available on
     new live data, a text classifier model could be developed to further narrow down
     the text. The screenshot of the dashboard in Figure 15 shows that 86 percent
     agreement was captured.

     FIGURE 3.15: Sentiment Validation




30                                          Project SmartFi: Exploring AI/ML for FinTech News
This dashboard was created in order to observe the results of the sentiment
model validation test. The three pie charts show, in order: the distribution of
the sentiments that WBG supplied; the sentiment distribution of the model
selected; and the percentage of agreement between the WBG expert and
the model.

Filters allow for the selection of:

•	   The 3 models + the 2 ensemble            •	   Agreement between the World
     methods                                       Bank Group and the model
•	   The sentiment outputs of                 •	   The World Bank Group’s
     positive, negative, neutral                   relevance indicator
•	   Confidence scores

These filters are useful for narrowing down the acceptability of the models, such
as on specific classifications and/or at specific confidence levels.

In conclusion, we found that FinBERT, an open-source sentiment model, can
be an effective way of producing accurate sentiment classifications that are
closely in line with WBG’s expert opinions. We also demonstrated that accuracy
can be boosted by adjusting the confidence thresholds, and by limiting the
scope to just positive and negative sentiment classes.


Chinese Translation

The Helsinki-NLP open-source model
https://huggingface.co/Helsinki-NLP/opus-mt-zh-en was used for
translation into Chinese, with similar models made available for all of the most
common languages.

Syntasa conducted a comparison of two translation options, Hugging Face/
Generic Models and Google’s Cloud Translation service. These two options
differ in several key aspects, including cost, speed, customization, and
language support.

Hugging Face models are a lower cost option for translation needs. They are
easy to customize by simply swapping the model type and offering a relatively
low cost run. However, they are slower than Cloud Translation, and they
require language detection capabilities in order to handle multiple languages.
Hugging Face models are also limited in their language support, as they are
designed for specific languages and require additional work to add language
detection capabilities.


Collaboration with Google Cloud and Syntasa                                         31
     Google’s Cloud Translation service is a fast and flexible option. It can easily be
     configured to stream directly, and it is capable of translating any language it
     can detect. This language-agnostic nature makes it a more flexible option for
     multilingual projects. However, Cloud Translation is also more expensive than
     the Hugging Face models ($20 per 1 million characters). This can make it a less
     suitable option for projects that require higher-volume translations and are
     operating within tight budgets.

     As part of the comparison, the WBG team manually evaluated the translations
     produced by each option. The Google translations were found to be superior
     in quality, mostly because the Hugging Face model didn’t completely translate
     the entire text. This analysis indicates that both options can be used, but the
     Google translation is preferable when a higher-quality translation is required.
     Figure 3.16 shows the comparison of the cloud vs the model translations, where
     it can clearly be seen that the cloud translation results are superior.

     FIGURE 3.16: Language Translation Performance

                                  Translation Comparison


                          Even
                          25%




        Huggingface Win
              4%
                                                                          Cloud Win
                                                                             71%




     Ultimately, the choice between Hugging Face models and Google’s Cloud
     Translation service will depend on the specific needs of the project. If cost
     is a primary concern, Hugging Face models offer a cost-effective solution.
     If flexibility and speed are important factors, Cloud Translation is the better
     option, despite its higher cost. The language requirements of the project should
     also be considered, since Hugging Face models may require additional work to
     support multiple languages.


32                                           Project SmartFi: Exploring AI/ML for FinTech News
PDF Sourcing

To gather PDFs related to specific topics, Syntasa explored two primary
methods: searching APIs and website crawling.

Search API, the most effective solution, was previously offered by Google but
is no longer available. As an alternative, Syntasa used Bing Web Search to
gather PDFs related to specific topics. This allowed for a more streamlined and
efficient process for sourcing relevant PDFs.

While website crawling is an ideal method for gathering every PDF, it proved to
be slow, cumbersome, and expensive. Therefore, it is not recommended as a
primary method for PDF sourcing. (See Figure 3.17.)

FIGURE 3.17: PDF Sourcing




Syntasa used the Bart-Large-CNN summarization algorithm to effectively
summarize the content of the sourced PDFs. This app can be easily modified
to incorporate any other summarization algorithms used by the WBG, or any
publicly available summarization models. Other models were tested, including
Google’s Pegasus model, but Syntasa did not perform in-depth evaluation
and comparison of the two models. The Bart-Large-CNN algorithm performed
sufficiently well for this use case since the focus was on PDF extraction.


Collaboration with Google Cloud and Syntasa                                       33
     In conclusion, Syntasa used Bing Web Search as an alternative to the
     previously offered Google Search API to gather PDFs related to specific
     topics. The Bart-Large-CNN summarization algorithm was used to effectively
     summarize the content of the PDFs, proving that it’s possible to extract PDF
     documents from specified domains and summarize them in order to increase the
     efficiency of the current manual process. For future evaluation and exploration,
     we recommend a more systematic review of various summarization algorithms.



     Trustworthy and Explainable AI

     For the two sentiment models put in place, bias tests were conducted by
     Syntasa using a Python library called Transformers-Interpret. This library can
     explain a PyTorch model derived from Hugging Face to display the weights of
     the features. It uses Facebook’s Captum to apply integrated gradients on the
     features in order to obtain the weights of each word in the text.

     By using this library, Syntasa was able to determine whether the way the
     models reacted to gender or racial terminology was significantly different. Using
     a word replacement strategy, the same sentence was used while switching out
     gender or race-related words (for example, “he” or “she”). The sentiment and
     confidence scores were then examined to determine whether these kinds of
     words could sway the model. In all of their experiments, the sentiment never
     changed, regardless of race or gender terms, and the confidence scores had
     some slight variability.

     FIGURE 3.18: Sentiment Explainability




34                                          Project SmartFi: Exploring AI/ML for FinTech News
The dashboard screenshot in Figure 3.18 shows the outputs of the model
trained on Twitter data and how it reacted to identical sentences where gender-
based terminology was swapped out. In all three examples, we can see that the
sentiments were the same whether a male or a female term was used; there
were also similar confidence levels.

In conducting this experiment, Syntasa now has the structure built for future
bias tests that will be able to accommodate new types of testing, as desired by
the WBG.



Solution Integration

There are many options available for integrating the Syntasa Data and AI
platform, and the sentiment analytics solution with the WBG IT environment.
Figure 3.19 shows the technical deployment architecture for the Syntasa
Platform in GCP, including the GCP services and network configuration.

FIGURE 3.19: Solution Architecture




Collaboration with Google Cloud and Syntasa                                       35
     The current POC was conducted as a private cloud SaaS where a single-
     tenant solution was hosted in a dedicated GCP project within a fully controlled
     Virtual Private Cloud (VPC). A similar solution architecture has been deployed
     for clients with highly sensitive data, and this architecture, when deployed in
     the WBG’s GCP organization, can achieve the highest level of compliance,
     including FedRAMP High.

     The solution can support Single Sign On to simplify the connectivity from the
     WBG network, using the existing corporate authentication services.

     For the POC, since only publicly available data was used, it was determined
     that the simplest compliant option was to host the solution in a Syntasa-
     controlled GCP project, and use the WBG GCP billing account. Given the initial
     success of the POC in demonstrating the potential of using large language
     models for automating text, and sentiment analysis for several use cases,
     with the additional exploration, development, and testing required to reach
     a production-ready state, we can envision proceeding with either a similar
     arrangement of a Syntasa-controlled GCP project, or a WBG-controlled GCP
     organization, folder, and project.




36                                          Project SmartFi: Exploring AI/ML for FinTech News
SECTION
                   LEARNING OUTCOMES AND FUTURE

 4                 CONSIDERATIONS




  Technical Learnings for
  World Bank
  Topic Modeling

  Topic modeling is a statistical and computational technique used to identify
  underlying topics or themes within a collection of texts or documents. It is a
  process of extracting meaningful patterns or themes from large volumes of text
  data. The goal of topic modeling is to identify the most significant topics present
  in the documents without prior knowledge of the topics.

  The most commonly used types of topic modeling are Latent Dirichlet
  Allocation (LDA) and Non-Negative Matrix Factorization (NMF).

  LDA is a probabilistic model that assumes that each document contains a
  mixture of topics, and each topic is a probability distribution over words. The
  model infers the topics based on the distribution of words in the documents.
  The output of LDA is a set of topics, along with the distribution of each topic
  across the documents, and the distribution of each word across the topics.

  NMF is a matrix factorization technique that decomposes the document-
  term matrix into two matrices, one representing the topics, and the other
  representing the words in the topics. The output of NMF is a set of topics, along
  with the weight of each word in the topics.

  Previously, TI Lab worked on several projects that required topic modeling. LDA
  was used primarily to tackle the grouping of documents into clusters. During
  our collaboration with Syntasa, they introduced us to their custom algorithm,
  which has proven to be more accurate and robust than LDA.


  Learning Outcomes and Future Considerations                                           37
     Syntasa’s team introduced us to a mix of three different algorithms used to
     achieve the project’s objective. This objective is based on the need for the
     text snippets to be assigned to multiple topics and for the topics to be named
     automatically. Since FinTech-related social media posts are extremely diverse,
     using the LDA algorithm is no longer a viable option.

     FIGURE 4.1: Topic Modeling

             K-means clustering
             (highest occurring
             phrases as cluster
             centers)

             Fast clustering
             (similarity checks using                    Syntasa clustering
             cosine similarity with
             volume criteria)


             Graph Networks
             (to link snippets to
             multiple topics)




     Syntasa’s team introduced us to mix of three different algorithms to achieve
     the project’s objective. This objective is based on the need for the text snippets
     to be assigned to multiple topics and for the topics to be named automatically.
     Since fintech-related social media posts are extremely diverse, using LDA
     algorithm is no longer a viable option.

     The K-means topic modeling technique is a clustering method that groups the
     documents into a fixed number of topics based on the similarity of their word
     frequencies. The key disadvantage of this modeling technique is that it requires
     manual naming of topics, and it assigns only one topic per snippet of text. It is
     also sensitive to the initial conditions, and the results may vary depending on
     the random initialization of the algorithm. However, if used in combination with
     other algorithms, it can provide valuable insights.

     Fast Clustering is another type of topic modeling that works somewhat like
     hierarchical clustering, but is tuned for speed. It is useful when the number of
     clusters is unknown and the dataset is quite large. With fast clustering, the
     developer can freely configure the threshold of what is considered to be similar.
     A high threshold will only find extremely similar sentences; a lower threshold
     will find more sentences that are less similar to each other.1

     1	https://www.sbert.net/examples/applications/clustering/README.html



38                                             Project SmartFi: Exploring AI/ML for FinTech News
Graph Networks can also be used to link multiple topics to a text. Graph
Networks represent the documents and topics as nodes in a graph, and the
relationships between them as edges. By analyzing the graph, it is possible to
identify the most significant topics and their relationships to the documents.
Graph Networks can also be used to visualize the topics and their relationships,
making it easier to interpret the results.

The solution is fully scalable, using Apache Spark on large data to take
advantage of the Could infrastructure.

The outcome for a real-life example is described using the image shown in
Figure 4.2. The phrase that was analyzed is “The blockchain network allows
users to avoid Central Banks.” This sentence clearly has more than one topic,
and the figure shows how it can be connected to three different topics: for
example, Allows Users; Blockchain Technology; and Central Banks.

FIGURE 4.2: Topic Modeling Explainer




Learning Outcomes and Future Considerations                                        39
     Sentiment Analysis

     Sentiment analysis is the use of natural language processing, text analysis,
     computational linguistics, and biometrics to systematically identify, extract,
     quantify, and study affective states and subjective information. Open source
     software tools, as well as a range of free and paid sentiment analysis tools
     such as RoBERTa, Google Cloud translation, and BERT automate sentiment
     analysis on large collections of texts, including web pages, online news, and
     blogs. Sentiment analysis is well-used at ITSTI to analyze internal documents,
     risk management, feedback review, online and social media data, and so on.
     Pretrained models with different datasets have different capabilities and
     strengths. Sentiment models should be selected based on the specific business
     demands and the available data. After exploring three Sentiment Analysis
     models for financial data in this prototype, we determined that the FinBERT
     model focuses on financial data and produces better results.

     TABLE 4.1: Sentiment Analysis Models

            Brandwatch​                  FinBERT​                      Twitter roBERTa​


      Multilingual Sentiment   FinBERT is a pre-trained         roBERTa-base model
      Model​                   NLP model to analyze             trained on ~58M tweets
                               sentiment of financial text​     and finetuned for
                                                                sentiment analysis with
                                                                the TweetEval benchmark​


      Hybrid approach to       Narrow focus on financial        Effective at picking
      Sentiment Analysis:      data​                            up colloquial talk​
      Knowledge-Based
      -> ML -> Custom Rules​




     Chinese Translation

     AI translation is a machine translation process based on complex, deep learning
     algorithms. Using intelligent behavior, it can understand a source text and
     generate another text in a different language.

     The translation is required in order to build a more robust tool covering other
     languages. Since Chinese is much used in FinTech-related data in Asia, during
     the Syntasa engagement, we applied both Simplified Chinese and Traditional
     Chinese themes and keywords to collect media data in Chinese. Then we tested
     different translation services on snippets, and compared the quality of Google
     Translation with Hugging Face. The results show that Google Translation


40                                             Project SmartFi: Exploring AI/ML for FinTech News
performs better than Hugging Face in terms of the completeness and accuracy
of the content; that is, Google is more comprehensive than Hugging Face and
it also works for long and complex texts. It also has more accuracy in some key
verb translations. Hugging Face also usually misses some content, especially in
the context of a long sentence, and it can’t recognize many professional terms
and proper nouns, such as brand names (for example, Moutai). But when the
sentence is short, Hugging Face is concise and accurate; it is not worse, and
sometimes it is even better than Google.



Business Intelligence Tool: Looker

Looker by Google is a business intelligence (BI) and data analytics platform,
aligned with Microsoft Power BI. This web-based tool offers plenty of analytics
capabilities that businesses can use to explore, discover, visualize, and share
analysis and insights. Looker earns good marks for reporting granularity and
scheduling, drag and drop interface, and prebuilt templates and data models.
Looker has more colorful UI graph options and a customizable layout size.
It is easy to apply Looker to visualizing results and building enterprise-level
products such as dashboards and websites. However, since Looker was
integrated into Google’s system just a few years ago, it has limited AI and
statistical functions. The price is also higher than Power BI.




Learning Outcomes and Future Considerations                                       41
     Business Learnings
     and Outcome
     This section describes how the dashboard can be useful for finance and
     technology users.


     Key Learnings (Technology)

     1	 Significance of input data: Input data is the foundation of any solution that
        aspires to use emerging technologies like artificial intelligence. Therefore,
        it is important to ensure that the data used to train AI models is accurate,
        representative, and sufficient in quantity.

     2	 Explainability and transparency: As AI models become more complex, it is
        important to ensure that they are explainable and transparent. The decision-
        making process of the model should be easily understood and verified by
        humans concerning which data is relevant; what data can be categorized
        into which theme/keyword; what data to exclude, and so on. Explainability
        and transparency can also help to build trust in the solution.

     3	 Continuous technology learning and improvement: One of the significant
        advantages of AI is that it can improve over time; but this requires
        continuous feedback and training. It is important to continuously monitor
        and evaluate the performance of AI models, and update them to ensure the
        relevance of results over time.




     Key Learnings (Project)

     1	 Clear base requirements: It is important to have clear and well-defined
        requirements for such a PoV. This will help to ensure that everyone involved
        in the project is on the same page and has a common understanding of what
        needs to be achieved. Technical scoping sessions are relevant steps in the
        process of streamlining project requirements, and ensuring their alignment
        with the relevant business needs. ITSTI, along with TREFT and Syntasa, will
        set up dedicated scoping sessions at project initiation to clarify the basic
        project requirements.



42                                          Project SmartFi: Exploring AI/ML for FinTech News
2	 Stakeholder engagement through collaboration and expertise: It is
   important to involve relevant stakeholders throughout project engagement,
   from the ideation phase through to scoping and development. TREFT
   has performed the role of the business user collaborating with ITSTI to
   finalize the business and technical requirements, and has collaborated with
   Syntasa as the developer of the solution.

3	 Agile approach: An agile approach toward this project enabled the solution
   to be developed as close to the relevant business needs as possible. Given
   the possibility of showcasing a key functionality during the engagement and
   its alignment with the business needs, the project teams tested the PDF
   Summarizer function in lieu of API integration.

4	 Testing and quality assurance: Continuous manual testing and analysis
   of parts of output data at different stages of the engagement has helped
   to maintain business relevance and ensure quality assurance. During this
   engagement, manual testing was especially important in areas related to
   topical relevance, quality of translation, and user interface. This helped to
   prevent issues and ensure that the solution is reliable and effective.




Key Business Outcomes:

1	 Efficiency. The ability to intelligently source relevant FinTech news by
   mimicking human logic, and to present it on a dynamic dashboard powered
   by Google’s Looker platform contributes to streamlining the tedious
   news-sourcing process, and reveals detailed insights on digital trends,
   and sentiment on the topics. Such a solution could help to save time and
   resources that would otherwise be spent sourcing important news manually.
   It could also reduce human error in identifying news sources that are
   potentially biased or irrelevant, as well as gather relevant news sources that
   a human might miss due to the massive volume of news data on the internet.

2	 Scalability. The consolidation and representation of large volumes of data
   on a dynamic dashboard such as Looker allows the user to customize
   search criteria based on user needs, and categorize data by drawing its
   relation to topical areas of interest. The functionality of reviewing market
   sentiments across multiple topics presents interesting insights that can be
   used as inputs in creating briefing notes, resources, knowledge material,
   slide decks, and reports for senior management review and the wider
   TRE audience.




Learning Outcomes and Future Considerations                                         43
     3	 Relevance. Ultimately, this solution can also allow treasury staff to stay
        fully involved in and informed about the most relevant happenings within
        the topics of interest, enabling the organization to potentially capitalize
        on key opportunities for innovation within this space, and leverage these
        technologies to improve TRE operations.



     The applicability of the solution to other use cases is another opportunity.
     Currently this solution captures news and material on a specific list of topics,
     and captures them from specific sources as defined by the project team. There
     is a possibility of changing the list of topics and sources, thus indicating the
     potential universality of the base solution (with customized features) across
     various use cases.




                          Considerations for Production Solution

     •	   Chatbot integration/plug in (BARD AI or ChatGPT): A solution that could
          enable the user to source the relevant information by conversing with
          a chatbot.
     •	   Language translation: A solution that could capture resources and
          materials in a multilingual setting, thus increasing the geographical reach
          and revealing more significant results.
     •	   PDF summarizer: A solution where large text files/PDFs are converted into
          an easily understandable and brief summary, with suggestions for how it
          could increase convenience for users.
     •	   Expand scope to test intelligence: A solution where the input data is more
          broadly categorized, and the output data is expected to be even more
          specific and filtered.




44                                           Project SmartFi: Exploring AI/ML for FinTech News
Appendices
     APPENDIX A


     Narrative Dashboard
     Features
     A templated narrative dashboard was deployed to hasten development
     time, limiting the scopeto first insights. Although each dashboard was then
     customized to best meet the requirements set forth in this POC, they share
     many of the same features:


     Filters

     To facilitate noise mitigation and focused exploration, two types of filters are
     included on the dashboard: cross-chart filters, and top-level filters. Cross-chart
     filtering enables users to interact with most of the elements on the dashboard.
     For example, on the topics table, if a specific topic is selected, the dashboard
     will filter all of the charts that are based on the selected topic.

     Top-level filters appear at the top of the dashboard, providing extensive filtering
     capabilities and allowing for the selection of a specific date range and time
     series chart granularity; inclusion and exclusion of any combination of themes,
     topics, phrases, companion phrases, types of mention (unique vs repeat), page
     type, domain, author, and/or language can be arranged.



     KPI Scorecards

     KPI measures at the top of the dashboard include the number of sampled
     mentions, modeled mentions, percentage of mentions modeled, calculated net
     sentiment, and oldest and newest mention dates with respect to the applied
     filters, providing a high-level overview.


46                                           Project SmartFi: Exploring AI/ML for FinTech News
Volume and Sentiment Time Series

These two visualizations, found underneath the scorecards, show FinTech
sampled and modeled mentions by volume, and net sentiment over time. In
addition to showing how volume and sentiment are changing over time, peaks
and valleys are often indicative of significant events of interest that may warrant
further investigation.



Countries

The dashboards include a table that shows mention volume, percentage, and
sentiment by country, along with a heat map visualization. Through these
features, users are able to understand and compare the level of engagement
and sentiment in various countries.



Themes

To facilitate top-down analysis, a series of tiles provide mention volume
and sentiment by theme; mention volume by theme over time; and mention
sentiment by theme over time. As detailed in Reference Data, the themes were
provided by WBG SMEs. They include Asset Tokenization, Digital Currency,
and Web3, and are consistent across all three narrative dashboards. This is
useful for understanding and comparing proportionality and sentiment across
various known areas of interest. The time series charts visualize changes in
the discussion to help users understand the ebb and flow of engagement and
sentiment for these themes.



Topics

Complementary to the top-down approach of themes, topics can be thought
of as being constructed from the bottom up. Using AI and natural language
processing (NLP), the mentions are analyzed to identify recurring phrases
and are dynamically grouped into topics. For example, the phrases “bitcoin,”
“btc,” and “ethereum” might be categorized under the topic “cryptocurrency.”
Topic Modeling provides more information on the topic of modeling
implementation.

As with themes, the same series of tiles is provided for topics to show how
prevalent the initial topics of interest are in digital narratives, as well as
additional topics that are emerging from the conversation. Often many of the


Learning Outcomes and Future Considerations                                           47
     collected mentions do not fit inside one of the predefined themes. These tiles
     typically surface as previously unrecognized topics of discussion that are taking
     place outside of the predefined themes, and are likely of interest.



     Phrases & Companion Phrases

     Phrases are identified by the algorithm using parts of speech to select the
     most relevant phrases and words. The algorithm also identifies the companion
     phrases that are used most commonly with each phrase. These tables show
     the most common phrases and companion phrases in FinTech-related posts
     and articles by volume and sentiment. Accompanying word clouds allow for
     visual analysis.

     Phrase volume and sentiment can be compared in order to understand the
     multitude of narratives taking place. One particular phrase can also be selected
     for deep analysis. By reviewing the associated companion phrases, users are
     able to determine the specific subject matter being discussed in relation to
     the broader topic of conversation. For example, when selecting the phrase
     “bitcoin,” the top two companion phrases that appear might be “ethereum” and
     “cardano.” This suggests that mentions that include the phrase “bitcoin” are
     often discussing “Ethereum” and “cardano” in relation to bitcoin.



     Page Type

     Page Type refers to the category of website the mention was found on; that
     is, news, forums, or blogs, as well as large social media platforms like Twitter,
     Facebook, and YouTube. The dashboards include the same series of tiles for
     Page Type as with themes and topics, and provide insights into where the
     discussion is taking place, comparative sentiments, and changes over time.
     Reach Estimate is an additional measure included here to explain which Page
     Type participants are most likely to engage with. (See more in Reference Data.)
     For example, a minority of the mentions may come from Twitter compared to
     mentions in news, suggesting that the bulk of the discussion is happening in
     the news. However, Twitter’s significantly higher reach estimate indicates that
     despite fewer mentions on the platform, significantly more people are likely to
     be exposed to those mentions.




48                                          Project SmartFi: Exploring AI/ML for FinTech News
Domains

A domain tile is included to analyze volume and sentiment. . Domain is
the domain name of the website from which the mention originated (for
example, Twitter.com). This table allows the user to understand and compare
engagement and sentiment across domains, or filter mentions to focus analysis
on one or more domains.



Authors

Author is the nickname, user name, or full name of the entity that posted a
mention. The authors table displays the author of a given post or comment, the
domain the content was posted to, the number and net sentiment of mentions
authored, and the author’s reach estimate. Users are able to identify key
participants, their sentiments, and their relative influence on the discussion.



Mention Details

The original text of the mention is displayed in the Mention Details table. This
table reveals the author of the comment, the text of the mention, the originating
domain, and the date it was posted, thus providing users with an expanded
context. A URL link button to the original source of the mention is included to
facilitate in vivo analysis. An impact score for each mention is also included to
help users understand the relative impact a mention is likely to have had in the
discussion, as discussed in Reference Data.




Learning Outcomes and Future Considerations                                         49
     APPENDIX B


     Reference Data
     Themes and Keywords

     The relevant smart finance keywords in the list were grouped and categorized
     by the World Bank Group, generating a total of three themes of interest, based
     on WBG business use cases: asset tokenization, digital currency, and Web3.
     Keywords ranged in specificity from a particular cryptocurrency such as Bitcoin,
     to more generalized terms, such as digital wallet.



     Asset Tokenization

     Asset Tokenization theme contained approximately 21 keywords:

     •	   Bitcoin               •	   Programmable             •	   Onyx
                                     Money
     •	   Circle (USDC)                                       •	   Orion
                                •	   Programmable
     •	   Cold Wallet/Hot                                     •	   Digital Promissory
                                     Payments
          Wallet                                                   Note
                                •	   Sats/Satoshis
     •	   Cryptowinter                                        •	   Digital Financial
                                •	   Tether (USDT)                 Market Infrastructure
     •	   Ethereum
                                                                   (DFMI)
                                •	   Stellar Development
     •	   Fungibile tokens
                                     Foundation               •	   carbon tokenization
     •	   ICO (Initial Coin
                                •	   Security Tokens          •	   carbon credits/
          Offering)
                                     Offering (STO)                certificates
     •	   NFT (Non-Fungible
                                •	   Digital Assets
          Tokens)
                                     Platform (DAP)



50                                          Project SmartFi: Exploring AI/ML for FinTech News
Digital Currency

Digital Currency theme contained approximately 24 keywords:

•	   Adoption                 •	   Digital Wallet          •	   Stablecoin
•	   Apple Pay                •	   Double Spending         •	   Wholesale CBDC
•	   CBDC (Central Bank       •	   Fiat currency           •	   Ripple
     Digital Currency)        •	   Financial inclusion     •	   Retail Central Bank
•	   DCEP (Digital                                              Digital Currency
                              •	   FOMO (Fear of
     Currency Electronic                                        (or Retail CBDC or
                                   Missing Out)
     Payment) / e-CNY /                                         rCBDC)
     Digital Yuan             •	   Google Pay
                                                           •	   Wholesale Central
•	   Delivery versus          •	   Instant Payment              Bank Digital
     Payment (DvP)            •	   MetaPay                      Currency (or Whole
                                                                CBDC or wCBDC)
•	   Digital Assets           •	   Public-Private
                                   Partnership (PPP)       •	   Atomic settlement
•	   Digital Dollar
•	   Digital Euro             •	   Retail CBDC




Web3

Web3 theme contained approximately 21 keywords:

•	   Blockchain               •	   Traditional Finance /   •	   Total Value Locked
                                   TradFi                       (TVL)
•	   Cryptocurrency
     dApps (Decentralized •	       Decentralized           •	   Loss/bankruptcy/
•	
                                   Exchange (DEX)               fraud/hack
     Apps)
                          •	       Oracle                  •	   Decentralized
•	   DLT (Distributed
                                                                Finance (DeFi)
     Ledger Technology)       •	   Hyperledger
                                                           •	   Interoperability/
•	   Ledger                   •	   Decentralized
                                                                Interoperable/Bridge
                                   Autonomous
•	   Metaverse
                                   Organizations           •	   Flash Loans
•	   MiCA—Markets in               (DAOs)
     Crypto-Assets Law
                              •	   Liquidity Pool
•	   Regulation
                              •	   Market Capitalization
•	   Smart contract                (Market Cap)




Learning Outcomes and Future Considerations                                            51
     Chinese Keywords

     The English keywords were later translated to Simplified Chinese and
     Traditional Chinese to facilitate collection and analysis of FinTech-related data
     authored in Chinese and likely originating from individuals and media sources
     closer to the Chinese markets (for example, Singapore). Initial translations
     were made by Syntasa using Google Translate service. These initial results were
     refined by WBG personnel who are fluent in written Chinese, and familiar with
     relevant cultural references related to smart finance.



     Asset Tokenization (Simplified Chinese)

     Asset Tokenization keywords in Simplified Chinese shown with multiple
     synonyms separated by “/”:

     资产代币化, 比特币, 世可/Circle/比特币Circle/比特币银行/比特币银行Circle,
     USDC, 冷钱包/硬件钱包/离线钱包, 热钱包/软件钱包/线上钱包, 加密寒冬/
     加密货币寒冬, 以太坊, 同质化代币/可替代代币/同质化通证/可替代通证, ICO/
     首次代币发行/首次发行代币/数字货币首次公开募资/数字货币首次公开发行/
     首次币发行, NFT/非同质化代币/非可替代代币/非同质化通证/非可替代通证/
     不可替代代币, 可编程货币/程序化货币, 可编程支付/程序化支付, Sats/Satoshis/
     中本聪, Tether/稳定币Tether, USDT/泰达币/稳定币USDT, 恒星币/ XLM(Stellar)/
     恒星网络/XLM, STO/证券型通证发行/证券化通证发行, 数字资产平台, DAP / DAP
     币, Onyx / Onyx币, Orion / Orion币, 数字本票/数字期票, DFMI/
     数字金融市场基础设施, 碳币



     Digital Currency (Simplified Chinese)

     Digital Currency keywords in Simplified Chinese shown with multiple synonyms
     separated by “/”:

     数字货币, 采用, 苹果支付, CBDC/中央银行数字货币/央行数字货币, DCEP/
     数字货币电子支付/数字货币和电子支付工具/ "DC/EP", 数字人民币/e-CNY,
     货银对付/DVP/券款对付, 数字资产, 数字美元, 数字欧元, 电子数字钱包/数字钱包,
     双重支付/重复花费/双花, 法定货币, 普惠金融/金融包容性, 错失恐惧症/FOMO/
     害怕错过/社交控, 谷歌支付/Google Pay, 即时付款, Meta pay / 脸书支付,
     公私合作制/公共私营合作制/政府和社会资本合作模式/公私伙伴关系/PPP,
     零售央行数字货币/零售CBDC/零售中央银行数字货币/零售型央行数字货币/
     零售型CBDC/rCBDC, 稳定币, 批发央行数字货币/批发CBDC/
     批发中央银行数字货币/批发央行数字货币/批发型CBDC/wCBDC, 瑞波币,
     原子清算/原子结算


52                                          Project SmartFi: Exploring AI/ML for FinTech News
Web3 (Simplified Chinese)

Web3 keywords in Simplified Chinese shown with multiple synonyms separated
by “/”:

web3, 区块链, 加密货币/密码货币/加密数字货币//虚拟货币, dApp/
去中心化应用程序/分布式应用程序/去中心化应用/分布式应用, DLT/分布式帐本技术/
分布式记账技术/分布式记账方式, 分布式帐本, 分类帐/分类账簿, 元宇宙,
欧盟加密资产市场监管法案/加密货币监管协议/MiCA, 监管, 智能合约, 传统金融/
TradFi, 去中心化交易所, 价值中介, Hyperledger/超级账本, DAO/去中心化组织/
去中心化自治组织, 流动性池/流动资金池/流动性储备资金, 市值, TVL/总锁定价值/
锁定的总价值, 损失, 破产, 欺诈, 黑客, 去中心化金融/分布式金融/DeFi, 互操作性,
可互操作, Bridge/区块链桥, Interoperab, 闪电贷/Flash Loan



Geographical Locations

WBG provided a list of 34 individual and collective countries of interest grouped
into six geographic regions:

•	   North America (US, Canada,               •	   Africa (Central African Republic,
     Mexico, Bahamas, and Caribbean)               Democratic Republic of the Congo,
                                                   Ghana, and South Africa)
•	   South America (Brazil, Ecuador,
     and Colombia)                            •	   Asia (China, Hong Kong, India,
                                                   Kazakhstan, Singapore, South
•	   Europe (European Union, Euro
                                                   Korea, Taiwan, Thailand, Japan,
     Area, European Economic Area,
                                                   Australia, New Zealand, and
     Ukraine, and Russia)
                                                   Vietnam)
•	   Middle East and North Africa
     (MENA—UAE, Saudi Arabia,
     Qatar, Israel, Turkey)



Trusted Domains

The WBG provided a list of 82 organizations of prioritized interest relating to
the predefined themes and keywords, accompanied by their website address
(domain) and grouped into categories by organization type.




Learning Outcomes and Future Considerations                                            53
     Organization Categories

     Central Bank, Consultancy, Digital Currency Institution, News Sources,
     Financial Services, International Development, Regulatory Body, Research
     Center, Technology Company, and Think Tank.

     These organizations represent a combination of authority figures, key players,
     and news sources participating in the many facets of finance. They are
     considered by WBG to be generally reliable, authentic, and trustworthy sources
     of information that is highly relevant to WBG business interests. As such, the
     collection was labeled Trusted Domains, referring to their website domain for
     the duration of the project. Notably absent are social media platforms, including
     Twitter and Facebook.




54                                          Project SmartFi: Exploring AI/ML for FinTech News
APPENDIX C


Brandwatch
Brandwatch Social Media Listening Platform

Brandwatch is a social media listening and analytics platform that provides
access to a wide range of online data sources including websites, social media
platforms, and news. Brandwatch automates the process of capturing data
from various sources. The platform uses web crawlers to continuously gather
data from millions of websites, including blogs, forums, and news sites. It
gathers news articles from thousands of sources, including major news outlets,
blogs, and online publications. Users also have access to data from all of the
major social media platforms (Facebook, Twitter, Instagram, LinkedIn, YouTube,
and Reddit).

Brandwatch’s query feature is used to build complex queries to retrieve data
that meets specific criteria, using key terms of interest in SQL-like queries
to retrieve relevant data such as mentions of a brand or product, competitor
activities, and industry trends.

Some of the key capabilities of Brandwatch’s query feature include:

Advanced filtering

A wide range of filtering options may be used in a query, allowing users to
narrow down their search results to only the data that is relevant to their
research. Filters can be applied based on a variety of criteria, including time
period, language, country, author, source type, and more. These can also filter
out irrelevant data, reducing the amount of noise in your dataset.




Learning Outcomes and Future Considerations                                       55
     Boolean operators

     Queries also support boolean operators, such as AND, OR, NOT, and NEAR.
     This enables users to create complex search queries that combine multiple
     search terms and filter criteria.

     Although Brandwatch also provides a range of analytics and visualization tools,
     these capabilities are limited in comparison to those that are easily achievable
     using Syntasa and Google Cloud. Through the Brandwatch API, we’re able to
     take advantage of this automated data capture with comprehensive coverage
     provided in near real-time; this can save time and effort compared to manually
     scraping data from these sources.

     Brandwatch’s mention metadata fields provide a rich set of information that can
     be used to filter, analyze, and visualize social media and online content. Here
     are some of the metadata fields that are available in Brandwatch and commonly
     used in Syntasa’s news and social media narrative solutions.

     Snippet

     Snippet is a snippet of the mention that best matches the query.

     Page Type

     Describes the kind of website the mention was found on in a more human-
     readable way. For example: “Blogs” “YouTube” “Dark Web” “QQ” “Facebook”
     “Tumblr” “Instagram” “Forums” “Twitter” “VK” “Review” “Sina Weibo” “Reddit”
     “4Chan” “LexisNexis Licensed News” “News”.

     Impact

     Impact is a Brandwatch metric used to measure the potential impact of an
     author, site, or mention. It has a logarithmic scale between 0–100, normalized
     for the users’ data to help them find what is most interesting for them. The
     impact score takes into account how much potential a mention has to be seen,
     as well as how many times it has been viewed, shared, or retweeted. (A decimal
     from 0–100.)




56                                         Project SmartFi: Exploring AI/ML for FinTech News
Reach Estimate

Reach Estimate is a score created by Brandwatch to estimate how many
individuals may have seen a piece of content. It is available for multiple data
sources, and enables the user to compare the reach of content from different
platforms and track development over time. (0, or a positive integer.)

Sentiment

Each mention within a query has a sentiment associated with it. The sentiment
of a mention can be positive, negative, or neutral. Sentiment is assigned
automatically by the system, but can be selected manually if required.
Brandwatch’s sentiment analysis is based on cutting-edge AI research in the
fields of Deep Learning and Natural Language Processing (NLP). Transformer
Architecture Language Models are pretrained on billions of words to develop
a deep general knowledge of over 100 languages before being applied to
sentiment analysis. This offers a more sophisticated understanding of context,
slang, and dialects. These models can detect sentiment indicated by:

•	   Words (including misspelled words), phrases, and sentence structure
•	   Emojis, emoticons, and multiword hashtags
•	   Negation, punctuation, and much more.




Learning Outcomes and Future Considerations                                       57
     APPENDIX D


     SmartFi – Trusted
     Domains Technical
     Details
     A Brandwatch query was constructed using the three themes and associated
     English keywords mentioned in Reference Data. The location filter was set to
     “worldwide” to enable later geographic analysis in the dashboard. Since the
     keywords were in English, to alleviate the need for additional translation in the
     Syntasa app, the language filter was set to English to ensure that only English
     content is searched and returned.

     Pluralized and wild-card variants of the keywords were included in the query.
     The “NEAR” operator was used to reduce noise created by generic keywords by
     helping to ensure that their presence in the mention occurred alongside other
     themes and keywords of interest. (For more details on Brandwatch features and
     data sources see Appendix C.)

     As the title suggests, the SmartFi - Trusted Domains exploration was primarily
     focused on the WBG list of 82 organizations of prioritized interest relating to
     the predefined themes and keywords. As such, advanced filtering was applied
     in the Brandwatch query to include only the results from those organizations’
     website domains (Trusted Domains).

     The final SmartFi - Trusted Domains dataset in Brandwatch consists of
     approximately 1M mentions in English from approximately 274k unique authors
     found worldwide across the 82 Trusted Domains, from January 1, 2018 through
     February 28, 2023. A “mention” refers to a specific instance of a keyword being



58                                           Project SmartFi: Exploring AI/ML for FinTech News
mentioned on social media, news sites, blogs, forums, or any other online
source that Brandwatch monitors. A mention can be in a tweet, a Facebook
post, a blog post, a news article, a forum thread, or any other piece of content
that contains the specified keyword.



Syntasa SmartFi - Trusted Domains App

Ingest

Brandwatch dataset

The ready-made Brandwatch API process included with Syntasa was
configured with the Brandwatch Trusted Domains query ID to ingest the
Brandwatch Trusted Domains query dataset into BigQuery at a 100 percent
sample rate via Brandwatch’s commercial API.

Each mention in the Brandwatch Trusted Domains dataset contains up to
103 associated mention metadata fields depending on the source, type, and
data availability of the mention. These fields include date, author, domain,
page type, sentiment, impact, reach, snippet, and geographical information
(when available).

In addition to the Brandwatch data, the reference data in Appendix B were also
ingested. The reference data were first manually copied into a single Google
Sheet with three tabs: Themes and Keywords, Trusted Domains, and Regions
and Countries. Three Spark processors, one for each tab, use Python code to
access the relevant tab via Google Cloud Storage API and insert them into a
BigQuery table.



Process

Noise Filter

Visual analysis in the SmartFi Trusted Domains - Narrative Looker dashboard
of the most recent 30 days of mentions revealed a high number of irrelevant
forum and review mentions originating from the trusted domains that were
categorized as Technology Company. These mentions included knowledge-
base articles, technical support forums, and app store reviews from Amazon,
Microsoft, Google, and Apple.




Learning Outcomes and Future Considerations                                        59
     Syntasa provides a multitude of ways to implement noise mitigation in the
     pipeline, including predefined processes with filtering parameters, or the
     option to define custom scripts and SQL queries. For demonstration in this
     POC, a Transform process was inserted in the app and an SQL “WHERE”
     clause was added to the filters to exclude the aforementioned mentions: where
     ((categories.Category != “Technology Company”) and ((data.pageType !=
     “forum”) AND (data.pageType != “review”)))

     Themes

     A Big Query (BQ) process was used to label themes associated with each
     mention based on matching keywords. Referencing the predefined Themes and
     Keywords, a mention was labeled Asset Tokenization, Digital Currency, or Web3
     if the snippet contained at least one keyword associated with one of these
     themes.

     Categories

     An organization category was assigned to each mention using the “Join”
     feature of the same Transform process containing the noise filtration mentioned
     above. Mentions were labeled with one of the ten Organization categories,
     based on a matching originating domain.

     Regions

     A geographic region was assigned to mentions that have an associated country
     provided by Brandwatch. Although this could easily be done using a process
     in the app, for demonstration purposes this was implemented in Looker using
     LookML. Similarly to how organization categories are assigned, the LookML
     references the Geographical Locations to assign one of the six predefined
     regions, based on a matching originating country.

     Topic Modeler, Phrases, and Companion Phrases

     A ready-made Topic Modeler process is used in the app to identify topics,
     phrases, and companion phrases. This process consists of Python code running
     in a Spark processor that applies AI and NLP to analyze each mention, and
     to identify recurring phrases and categorize them into topics. For example,
     the phrases “bitcoin,” “btc,” and “Ethereum” might be categorized under the
     topic “cryptocurrency.” (See Topic Modeling for additional details on Syntasa’s
     implementation.)




60                                         Project SmartFi: Exploring AI/ML for FinTech News
The snippet is first cleansed using regular expressions to ensure the snippets
processed by the topic modeler consist of only alphanumeric characters and
spaces. Given the Trusted Domains sources do not include social media, the
parameter to include hashtags for analysis was set to disabled. Mentions with
short, nonsensical, and/or unrelated text are automatically discarded by the
topic modeler.

As with observations discussed regarding the noise filter, visual analysis in the
SmartFi Trusted Domains - Narrative Looker dashboard of the most recent 30
days of mentions revealed several irrelevant and/or undesirable topics. A series
of “stop” words were provided to the topic modeler to suppress these: the, this,
an, that, do, these, is, has, have, was, had, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o,
p, q, r, s, t, u, v, w, x, y, z, continue reading, also read, rights reserved, privacy
policy, not be, total views, use cookies.

The resulting BigQuery table is an expanded view consisting of a row for every
unique combination of a topic, phrase, and/or companion phrase associated
with a particular snippet.

Combine

Finally, to facilitate analysis in a Looker dashboard, an SQL query in a BQ
process is used to join the intermediary tables containing the Brandwatch
data, themes, categories, regions, topics, phrases, and companion phrases
into a single, unified table. The unique mention resource ID is referenced in
the LookML to essentially collapse the expanded dataset back to ensure that
each mention and associated metadata are accounted for only once during the
dashboard analysis.



Activate

Initially, one week of Brandwatch data was ingested; processed to ensure that
the pipeline was operating properly; and analyzed in the Looker dashboard to
identify data quality issues such as sources of noise. After updating the noise
filter and stop words, the process was repeated for the most recent 30 days,
and then expanded even further to incorporate mentions from the last five years
(January 1, 2018 to current day) for historical analysis. The last step taken was
to enable a scheduled job to automatically ingest and process new Brandwatch
data once a day to allow continued analysis moving forward.




Learning Outcomes and Future Considerations                                                  61
     APPENDIX E


     SmartFi – Uncertain
     Domains Technical
     Details
     For the SmartFi - Uncertain Domains solution, the SmartFi - Trusted Domains
     Brandwatch query was modified (see SmartFi - Trusted Domains). The same
     keywords, location filter, and language were used. Data sources include
     social media (Twitter, Facebook, Reddit, Tumblr, YouTube), blogs, forums, and
     news websites. However, unlike with the SmartFi - Trusted Domains, which
     focused exclusively on the Trusted Domains, the advanced filtering in the
     SmartFi - Uncertain Domains query was modified to explicitly exclude results
     from Trusted Domains.

     The final SmartFi - Uncertain Domains dataset in Brandwatch consists of about
     83M mentions in English from about 6M unique authors found worldwide from
     December 1, 2022 through February 28, 2023.



     Ingest

     Brandwatch Dataset

     The ready-made Brandwatch API process included with Syntasa was
     configured with the Brandwatch Uncertain Domains query ID to ingest
     the dataset into BigQuery at an ~1.85 percent sample rate—the maximum
     Brandwatch given the data set volume—via Brandwatch’s commercial API.
     The metadata fields remain the same as described in the SmartFi - Trusted
     Domains app.


62                                         Project SmartFi: Exploring AI/ML for FinTech News
Themes

In addition to the Brandwatch data, the Themes and Keywords were also
ingested, as described in the Trusted Domains app.

Twitter

Full tweet text was retrieved directly from the Twitter API for all tweet IDs
included in the Brandwatch data set through a Spark processor with custom
Python code that leverages off-the-shelf libraries such as Requests, Pandas,
and JSON. The tweet text is then inserted into the Brandwatch data set as the
mention snippet in a second Spark Processor.



Process

Themes

As with the SmartFi - Trusted Domains app, the same BQ process was used to
label themes associated with each mention based on matching keywords.

Topic Modeler, Phrases and Companion Phrases

Visual analysis in the SmartFi Uncertain Domains - Narrative Looker dashboard
of the most recent 30 days of mentions revealed several irrelevant and/or
undesirable topics. No noise filter was implemented in the app. However, a
series of stop words were provided to the topic modeler to suppress these: the,
this, an, that, do, these, im, is, has, have, was, had, a, b, c, d, e, f, g, h, i, j, k, l, m,
n, o, p, q, r, s, t, u, v, w, x, y, z, amp, rt, follow, retweet, tweet, quote, comment,
the, a, this, an, that, do, these, im, i, is, has, have, was, had, huh, th, else, did,
http, https

Combine

Finally, to facilitate analysis in a Looker dashboard, an SQL query in a BQ
process is used to join the intermediary tables containing the Brandwatch data,
themes, topics, phrases, and companion phrases into a single unified table. The
unique mention resource ID is referenced in the LookML to essentially collapse
the expanded dataset back to ensure that each mention and the associated
metadata are accounted for only once for dashboard analysis.




Learning Outcomes and Future Considerations                                                      63
     Activate

     Initially, one day of Brandwatch data was ingested, processed, and analyzed
     in the Looker dashboard to ensure that the pipeline was operating properly
     and to identify sources of noise. After updating the noise filter and stop words,
     the process was repeated for the most recent seven days and then expanded
     even further to incorporate mentions from December 1, 2022 to the current
     day for historical analysis. As with the SmartFi - Trusted Domains, the last step
     taken was to enable a scheduled job to automatically ingest and process new
     Brandwatch data once a day to allow continued analysis moving forward.




64                                          Project SmartFi: Exploring AI/ML for FinTech News
APPENDIX F


SmartFi – Chinese
Language Technical
Details
For the SmartFi - Chinese Language solution, the SmartFi - Uncertain Domains
Brandwatch query (SmartFi - Trusted Domains) was modified. The same
location filter—Worldwide—was used. However, the language was limited to
Chinese and the Simplified Chinese keywords were used in place of the English
terms. Again, data sources include social media (Twitter, Facebook, Reddit,
Tumblr, Youtube), blogs, forums, and news websites. Trusted Domains were
not excluded.

The final SmartFi - Chinese Language app dataset in Brandwatch consists
of ~69K mentions in Chinese from ~17K unique authors found worldwide on
February 7, 2023.



Ingest

Brandwatch Dataset

The ready-made Brandwatch API process included with Syntasa was
configured with the Brandwatch Chinese Language query ID to ingest the
dataset into BigQuery at an ~37.5 percent sample rate—the maximum
Brandwatch provided given the data set volume—via Brandwatch’s commercial
API. The metadata fields remain the same as described in the SmartFi - Trusted
Domains app.



Learning Outcomes and Future Considerations                                      65
     Themes

     Themes and Keywords were also ingested as described in the Trusted
     Domains app.

     Twitter

     As with the SmartFi - Uncertain Domains, the full tweet text was retrieved
     directly from the Twitter API for all tweet IDs included in the Brandwatch data,
     and inserted into the Brandwatch dataset as the mention snippet.



     Process

     Themes, Topics, Phrases and Companion Phrases

     Processing for themes, topics, phrases, and companion phrases occurred the
     same as in the SmartFi - Uncertain Domains app.

     Translation

     To facilitate theme assignment and topic modeling, the snippet text was
     translated into English using a ready-made Translate process which uses a
     pre-trained Opus-MT model available for download on Hugging Face
     https://huggingface.co/Helsinki-NLP/opus-mt-zh-en. See Chinese Translation
     for additional details on Syntasa’s Chinese to English translation
     implementation.

     Combine

     To facilitate analysis in a Looker dashboard, the SQL query used to join the
     intermediary tables in the SmartFi - Uncertain Domains was modified to include
     both the original Chinese snippet and the translated-into-English snippet.



     Activate

     Only one day (February 7, 2023) of Brandwatch data was ingested, processed,
     and analyzed in the Looker dashboard, to ensure that the pipeline was
     operating properly and to allow for proper evaluation.




66                                          Project SmartFi: Exploring AI/ML for FinTech News