AI Support for Interpreting ESG Investing Performance

Environmental, Social, and Governance (ESG)-related data has seen a dramatic surge in popularity among investors over the past two decades. This trend is mirrored by a significant increase in corporations providing ESG reporting; in 2023, 93% of Russell 1000 companies published a sustainability report. However, the relationship between sustainability and financial performance remains hotly debated among researchers and practitioners, creating challenges for investors to optimize decision-making with the use of this new criteria.

 

The meteoric rise in ESG data creates a new opportunity to better understand this relationship. Luckily, the increase in available information is parallel to the democratization of Large Language Model (LLM) usage to more efficiently analyze large sets of data in a systematic way. To address this challenge with a newly available tool, a team of student researchers at NYU Stern including Shridhar Mehendale, Aaditya Shah, and Siddha Kanthi, under the guidance of Ulrich Atz, Research Fellow with NYU Stern CSB, conducted a study to evaluate the effectiveness of LLMs in conducting systematic reviews of ESG literature. The study, titled "The Efficacy of Large Language Models in Systematic Reviews," compared the performance of state-of-the-art LLMs and two custom GPT models trained for this task in interpreting ESG studies against traditional, manual review methods. 

Methodology

The researchers evaluated two leading LLMs, Llama 3 8B and GPT-40, in their ability to classify the relationship between ESG and financial performance on 88 papers published between March 2020 and May 2024. All papers had previously been classified by human reviewers and LLM findings were compared to those of the manual evaluation.

Each LLM was tested using nine specific prompts in a 3x3 design, focusing on three main questions:

  1. Does the study conclude a relationship between sustainability and financial performance? (positive, negative, or mixed/neutral)
  2. How is financial success implemented? (market-based, accounting-based, both, or other)
  3. How is sustainability implemented? (ESG, E, S, G, CSR, or other)

Each question had three successive prompts, and the results from each prompt were analyzed for accuracy.

 

Exhibit 1: The three prompts used for question 1. 

Key Findings

  1. Llama 3 demonstrated the highest accuracy. Llama 3 outperformed base GPT-4o on eight out of nine prompts, despite only being provided with paper abstracts. See Appendix for accuracy data on all prompts and subsets.
  2. Overall accuracy of base LLM analysis ranged from 49% to 85%. Most tests demonstrated LLMs increased accuracy through successive prompts. For example, the 49% was the level of overall accuracy for just one prompt from question 2. The level of accuracy increased to 67% by the third prompt from that question. 
  3. Custom GPT-4o Mini and "Custom GPT" chatbots showed significant improvements over base models. Fine-tuned GPT-40 Mini model outperformed the base LLMs by 28% on average in overall accuracy on prompt 1. The differentiation in accuracy decreased upon further prompts.

Implications and Future Work

This study highlights that while current base model LLMs are not yet reliable enough to fully replace human reviewers, fine-tuned LLMs hold promise in accelerating systematic reviews. Future research should focus on:

  1. Fine-tuning LLMs with larger, diverse datasets
  2. Enhancing LLMs' ability to interpret complex ESG information
  3. Integrating LLMs into systematic review workflows to streamline data extraction, synthesis, and reporting

As ESG investing continues to grow in complexity, AI tools will play a critical role in helping investors and researchers manage the increasing volume of information. However, it will also be essential to distinguish genuine insights from noise to maintain the credibility and utility of ESG analyses as demonstrated by a study on AI-Powered (Finance) Scholarship. Fine-tuned LLMs, if thoughtfully implemented, could provide the financial industry with valuable tools to better understand the intersection of sustainability and financial performance.

Appendix

Key:

L3: Llama 3 8B

40: Base GPT-40

CGPT: Custom GPT

40M:  GPT-40 Mini

40M FT: Fine-tuned GPT-40 Mini

Table 1: Accuracy of LLMs on Prompts 1A-C

Table 2: Accuracy of LLMs on Prompts 2A-C

Table 3: Accuracy of LLMs on Prompts 3A-C