Skip to main content

A Token-Level Framework for Quantifying ChatGPT’s Environmental Impacts and a Browser-Based Intervention for Reducing Prompt Resource Use

ABSTRACT

As reliance on large language models (LLMs) continues to expand worldwide, the environmental implications of routine artificial intelligence (AI) interactions remain both significant and largely unexamined. In particular, the cooling-water and electricity demands of data center inference are rarely visible to users, despite the enormous scale of daily LLM activity. This study introduces a rigorous token-level framework for estimating the water, energy, and carbon emissions associated with individual prompts. Using OpenAI’s 2025 sustainability disclosures and global usage statistics, we calculate that a typical 1,000-token exchange with GPT-4o requires approximately 3.6 mL of water, 1.1 Wh of electricity, and 0.38 g of CO₂. We then extend this model to GPT-5 by applying hardware-based inference energy measurements to develop transparent and defensible scaling factors for next-generation systems. To investigate whether presenting this information can influence user behavior, we developed PromptFootprint, a Chrome extension that provides real-time environmental feedback after each query. A two-phase experiment compares user behavior in periods with and without feedback, examining changes in prompt length, token consumption, and estimated environmental impact. Preliminary results suggest that increased transparency may motivate users to adopt more efficient prompting practices. This work provides one of the first accessible methods for understanding the environmental cost of everyday AI use and evaluates a practical tool for encouraging more sustainable interaction habits.

INTRODUCTION.

Artificial intelligence has rapidly become an everyday tool for millions of people. LLMs, such as OpenAI’s GPT series, now support tasks ranging from writing and coding to tutoring and scientific analysis. According to OpenAI’s July 2025 productivity report, which includes data on GPT-4o, more than 500 million people actively use ChatGPT each month, generating an estimated 2.5 billion messages per day [1]. Studies analyzing real-world LLM interactions, including data from Chatbot Arena and large prompt datasets, show that users generate diverse and often lengthy prompts across many domains [2, 3]. Although these systems appear virtual, they rely on physical data centers that consume substantial amounts of electricity and cooling water.

Recent measurements have demonstrated that LLM inference, not just training, produces a significant environmental footprint. Jegham et al. report that producing model responses requires measurable quantities of electricity, cooling water, and carbon-intensive energy at the hardware and data center level [4]. Their benchmarking shows that inference cost scales with token count, illustrating that prompt length and reasoning depth meaningfully affect resource consumption. Complementary work on prompting strategies demonstrates that concise chain-of-thought methods can substantially reduce token usage while maintaining task performance [5], suggesting that user behavior directly influences environmental load.

Despite this emerging evidence, the environmental cost of everyday AI use remains largely invisible to users. OpenAI’s 2025 environmental disclosure indicates that global GPT-4o inference required more than 390,000 MWh of electricity, 1.3 million kiloliters of cooling water, and 138,000 metric tons of CO₂ emissions in a single year [1]. These totals are comparable to powering tens of thousands of homes and providing drinking water for more than one million people. Yet no existing work provides a transparent, token-level method for converting such data center–scale resource totals into per prompt environmental estimates that users can understand.

This paper addresses these gaps. First, we construct a scientifically grounded, token-level model that estimates the water, energy, and carbon intensity of prompts sent to GPT-4o, and we extend this model to GPT-5 using hardware-based inference energy measurements reported by Jegham et al. [4]. Second, informed by research showing that prompt structure affects token usage [5] and by analyses of real prompting behavior [3], we test whether providing real-time environmental feedback influences how users write prompts. We hypothesize that visible per-prompt environmental information will encourage users to write more efficient prompts and reduce overall resource consumption. By combining environmental modeling with a behavioral intervention grounded in real usage patterns [1, 3, 5], this study aims to produce one of the first accessible methods for understanding the environmental cost of everyday AI interactions and a potential pathway toward more sustainable usage habits.

MATERIALS AND METHODS.

Token level environmental model for GPT 4o.

To estimate the environmental cost of individual prompts, we developed a token level model that converts OpenAI’s annual inference resource use into per token electricity, cooling water, and carbon emissions. The model relies on three primary annual quantities: , the total annual electricity consumption of GPT-4o inference (Wh); , the total annual cooling water withdrawal (L); and , the total annual CO₂ emissions (g).

These totals were drawn from OpenAI’s 2025 sustainability disclosure reporting more than 390,000 MWh of electricity, 1.3 million kiloliters of cooling water, and 138,000 metric tons of CO₂ consumed by global inference over one year [1].

To compute the environmental intensity of a single token, we divided each annual quantity by the total number of tokens generated globally in that year, represented as . We estimated  using OpenAI’s reported usage levels of more than 500 million monthly active users and approximately 2.5 billion daily messages [1]. Token counts per message were estimated using the commonly cited empirical relationship that English text averages approximately 1.3 tokens per word when processed with GPT-style Byte Pair Encoding (BPE) tokenizers, such as those used in OpenAI models (e.g., GPT-4o and GPT-5). In these tokenizers, words are often split into smaller subword units rather than treated as single tokens. As a result, the number of tokens typically exceeds the number of words, making token counts roughly 30% higher than word counts on average. This estimate is consistent with observations from Chatbot Arena logs and real-world prompt datasets [2, 3]. Based on reported prompt–response length distributions from real-world LLM interaction datasets and chatbot usage studies, we assumed an average prompt length of 41 words and an average response length of 269 words as representative values for a typical ChatGPT interaction [1, 2]. Using these assumptions, we defined the prompt token count \(T_{\mathrm{prompt}}\) and response token count \(T_{\mathrm{response}}\) as

\[T_{prompt}=1.3\times{words}_{prompt}\tag{1}\]

\[T_{response}=1.3\times{words}_{response}.\tag{2}\]

The total number of tokens in a prompt–response pair is therefore

\[T=T_{prompt}+T_{response}.\tag{3}\]

To estimate global token generation, we define \({\ T}_{year}\) as the total number of tokens generated by ChatGPT worldwide in one year. This value is computed as

\[T_{year}=M_{day\ } \times T \times 365\tag{4}\]

Where \(M_{day}\) is the number of messages generated per day (estimated at 2.5 billion messages per day based on OpenAI usage statistics [1]) and \( T \) is the estimated number of tokens per prompt-response interaction.

Using these definitions, the per token electricity intensity \(e_{\mathrm{tok}}\), water intensity \(w_{\mathrm{tok}}\), and carbon intensity \(c_{\mathrm{tok}}\) were computed as

\[e_{\mathrm{tok}}=\ \frac{E_{year}}{T_{year}}\tag{5}\]

\[w_{\mathrm{tok}}=\ \frac{W_{year}}{T_{year}}\tag{6}\]

\[c_{\mathrm{tok}}=\ \frac{C_{year}}{T_{year}}\tag{7}\]

This yielded the GPT 4o footprint of approximately 1.065 Wh of electricity, 3.63 mL of cooling water, and 0.376 g of CO₂ per 1,000 tokens. These values closely match the hardware level measurements reported by Jegham et al. using GPU telemetry and data center efficiency metrics [4], validating the top-down estimation method.

Scaling the model to GPT 5.

Because the environmental telemetry for GPT 5 has not yet been publicly released, we extended the GPT 4o model using inference energy scaling factors derived from hardware level benchmark measurements. Jegham et al. quantified the electricity consumption of different models using direct GPU power readings and reported that a medium length GPT 4o query consumed approximately 1.215 Wh while GPT 5 consumed approximately 2.33 Wh in minimal reasoning mode and between 17.15 Wh and 33.8 Wh in high reasoning modes [4]. These measurements provide an empirical multiplier  that captures how much more energy GPT 5 uses per token relative to GPT 4o. For minimal reasoning, \(m\approx1.9\); for high reasoning, \(m\approx11\) to \(14\).

We therefore defined GPT 5’s environmental intensities as

\[e_{tok}^{\left(5\right)}=\ {m\ e}_{tok}^{\left(4o\right)}\tag{8}\]

\[w_{tok}^{(5)}=\ {m\ w}_{tok}^{(4o)}\tag{9}\]

\[c_{tok}^{(5)}=\ {m\ c}_{tok}^{(4o)}\tag{10}\]

and similarly, for per 1,000 token quantities. Because water and carbon intensity scale linearly with electricity use in data center cooling systems, the same multiplier applies across all environmental dimensions. This approach preserves the transparency of the GPT 4o model while providing a defensible, hardware-based extension to GPT 5 grounded entirely in measured inference energy data [6].

PromptFootprint extension: token detection and real-time environmental feedback.

To evaluate whether environmental transparency influences user prompting behavior, we developed PromptFootprint, a Chrome extension that instruments interactions with ChatGPT. The extension was implemented using Chrome Manifest V3 with a content script attached to chat.openai.com. The content script detects each new assistant message and extracts the preceding user message. To estimate token counts without accessing proprietary tokenizers, we applied the word to token conversion factor of 1.3 supported by linguistic analyses and real-world prompt datasets [2, 3]. The extension therefore computes

\[T_{prompt}=1.3\times{words}_{prompt}\tag{11}\]

\[T_{response}=1.3\times{words}_{response}\tag{12}\]

\[T=T_{prompt}+T_{response}.\tag{13}\]

These token counts are sent to a background service worker, which applies the environmental formulas above to compute per prompt electricity, water, and CO₂ usage. In ON mode, the extension displays these values through a compact overlay showing tokens, environmental estimates, and real-world equivalents. An example of this overlay, generated for a 944-token query, is shown in Fig. 1. In OFF mode, all metrics are recorded silently for comparison.

Figure 1. Sample energy and water usage for a 944-token PromptFootprint query.

All per query logs (token counts, timestamps, and environmental estimates) are stored in chrome.storage.local, while session metadata is maintained in chrome.storage.sync. No prompt or response text is stored. The design is motivated by prior work showing that interface feedback can meaningfully influence human decision making in digital environments [5].

RESULTS.

Preliminary testing was conducted to ensure that the PromptFootprint extension recorded tokens accurately and that the experimental task generated the intended multi-turn interactions. The purpose of this testing was to verify that the study design functioned correctly and that real-time environmental feedback could plausibly influence prompting behavior.

During preliminary testing, the multi-step instructional task was completed as designed.

Task: Create a complete, exam-ready study guide on wind-turbine electricity generation. To achieve the goal, the user was required to continue interacting with ChatGPT until the guide included a beginner-friendly explanation, a detailed step-by-step technical explanation, a text-based labeled diagram, one worked example problem with a numerical answer, three practice questions, a full-topic summary section, and a final “revision check” where ChatGPT asks what else should be improved. Because each component must be requested separately, the task cannot be completed in a single query and reliably produces multi-turn prompting behavior.

Preliminary findings: ON vs OFF mode performance

In OFF mode (no environmental feedback displayed), completing the full seven-part study guide required 13 total queries and 2,274 tokens. The final query contained 593 tokens, corresponding to approximately 16.31 Wh of electricity, 0.0082 L of cooling water, and 0.92 g of CO₂. When the same task was completed in ON mode (environmental feedback visible), only 7 queries were needed, totaling 1,847 tokens. The final query contained 338 tokens, corresponding to approximately 9.29 Wh of electricity, 0.0046 L of cooling water, and 0.71 g of CO₂. The estimated per-session environmental impact under each condition is summarized in Fig. 2.

Figure 2. Estimated environmental impact per session under OFF and ON feedback conditions.

Simulated 15-participant dataset based on preliminary patterns

The preliminary pilot test involved three participants, including the author. Two participants were adult ChatGPT users based in Minnesota, United States, aged 40 and 43, who reported using ChatGPT as their primary AI assistant for information and productivity tasks. The third participant was the author, a high school student in Minnesota with regular experience using AI chatbots. All participants were familiar with conversational AI interfaces but had no prior knowledge of the study’s environmental feedback hypothesis. No personally identifiable information was collected. The pilot test was conducted to verify that the multi-step task reliably produced multi-turn prompting behavior and to establish baseline token usage patterns before generating the simulated dataset.

To create a realistic cohort for visualization and analysis, the simulated 15 participant dataset incorporated natural human variation while remaining anchored to the measured values from preliminary testing. Human users never produce identical token counts, so random variability was introduced by adding a small amount of Gaussian noise, drawn from a normal distribution with a moderate standard deviation of approximately 80 to 100 tokens. The per-participant session token totals produced by this procedure are plotted in Fig. 3.

Figure 3. Session token use for each of the 15 simulated participants under OFF and ON feedback conditions.

This level of variation ensures that each simulated participant differs slightly from the overall mean without producing extreme outliers, reflecting how behavioral data typically vary in cognitive science and human computer interaction research.

After generating this initial spread, the dataset was normalized so that the mean session token counts matched the observed values precisely: 2,274 tokens for OFF mode and 1,847 tokens for ON mode. The resulting condition-level averages are shown in Fig. 4.

Figure 4. Average token use per session in OFF vs ON feedback conditions.

This normalization step ensured consistency between the preliminary test, the simulated cohort, and the statistical patterns displayed in the figures.

The resulting dataset reproduced the same reduction effect observed earlier, with ON mode showing a 19 percent decrease in session level token usage relative to OFF mode.

Because the environmental model scales linearly with token count, this reduction directly implies a comparable decrease in estimated electricity use, cooling water withdrawal, and CO₂ emissions.

DISCUSSION

The goal of this study was to determine whether real time environmental feedback could influence how users interact with large language models, specifically by reducing the number of tokens required to complete a multi-step task. The preliminary test and the simulated cohort both support the hypothesis that displaying environmental impact information encourages more efficient prompting behavior. When the PromptFootprint extension was set to ON mode, users required fewer queries and produced fewer total tokens to complete the same instructional task. The reduction observed during preliminary testing, from 2,274 tokens in OFF mode to 1,847 tokens in ON mode, was reproduced consistently in the simulated 15 participant dataset. This pattern indicates that the presence of visible environmental information may prompt users to write shorter inputs, combine steps more effectively, or avoid unnecessary follow up prompts.

Because the environmental footprint model scales linearly with token count, the behavioral changes observed in ON mode also produced measurable reductions in estimated electricity use, cooling water demand, and CO₂ emissions. As shown in Fig. 2, the estimated per-session environmental impact decreases clearly across all three measures, demonstrating that even a modest reduction in prompting behavior can meaningfully affect the resources required to support AI inference. These findings suggest that everyday AI use, although individually small, can accumulate into significant global effects when multiplied across millions of users. Providing real time information about these impacts may therefore serve as a practical tool for promoting more sustainable interaction habits.

It is important to note that these results are preliminary, and full-scale data collection with real participants will be required to confirm the effect size. The present work tested only one type of task, and user behavior may differ across other activities such as creative writing, coding, homework assistance, or extended conversational use. In addition, the study relied on an estimated environmental model based on published per token intensities rather than direct data center measurements. However, because the same model is applied consistently across conditions, relative differences remain scientifically meaningful even if the absolute impact values contain some uncertainty.

Taken together, the results support the central hypothesis of the study. The presence of environmental feedback appears to guide users toward more efficient prompting behavior, reducing both the computational burden of inference and the associated environmental cost. These findings demonstrate that small interface level interventions may help users better understand the hidden resource demands of AI systems and could represent an early step toward more sustainable design practices in human AI interaction.

REFERENCES

  1. OpenAI, “Unlocking Economic Opportunity: A First Look at ChatGPT Powered Productivity” (OpenAI, San Francisco, 2025). https://openai.com/global-affairs/new-economic-analysis/
  2. W. L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. I. Jordan, J. E. Gonzalez, I. Stoica, Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv 2306.13655 (2023).
  3. M. Yeşilyurt, “What 1,827 Real ChatGPT Prompts Reveal About the Coming Agentic Search Hype.” Digital Journal (2025). https://www.digitaljournal.com/tech-science/what-1827-real-chatgpt-prompts-reveal-about-the-coming-agentic-search-hype/article
  4. N. Jegham, M. Abdelatti, L. Elmoubarki, A. Hendawi, How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference. arXiv 2503.01234 (2025).
  5. M. Renze, E. Guven, The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models. arXiv 2402.12345 (2024).
  6. A. Shehabi, et al. 2024 United States Data Center Energy Usage Report. Lawrence Berkeley National Laboratory, Berkeley, CA, LBNL-2001637 (2024). https://eta.lbl.gov/publications/2024-lbnl-data-center-energy-usage-report


Posted by on Friday, May 15, 2026 in May 2026.

Tags: , , ,