The Shifting Drivers of Residential Energy Tax Credit Usage: An Ensemble Machine Learning Approach
ABSTRACT
By analyzing the relationship between income, educational attainment, age and usage of residential energy tax credits, this study aims to understand how demographic factors beyond income play a role in driving adoption of energy efficient technologies. Combining data from the IRS’s Statistics of Income program and demographic data from the American Community Survey, a Random Forest Regressor and a K-Means Clustering algorithm were employed to reveal the nature of adoption across demographic groups. The results show that over time, income has become a less important predictor of energy credit usage, and the impact of educational attainment has nearly doubled. Furthermore, a cross-sectional analysis of 2022 data reveals that within middle-income communities, higher educated households claim larger credit amounts than their less-educated peers, suggesting a divergence in technology adoption. These findings indicate that current incentives favor administratively literate households, necessitating policy reforms such as point-of-sale rebates to democratize access to the green energy transition.
INTRODUCTION.
For over a decade, the United States federal government has utilized the tax code to address the negative externalities of residential energy consumption. Through mechanisms such as Section 25D (for solar, geothermal, and wind) and Section 25C (for energy efficiency improvements), policy has aimed to lower the effective upfront cost of decarbonization technologies [1]. The theoretical base behind these credits is the Pigouvian subsidy: by subsidizing the private cost of adoption, the government hopes to align private incentives with the social benefit of reduced carbon emissions. However, the efficacy of a non-refundable tax credit depends heavily on a taxpayer’s ability to provide the initial capital cost, as well as their level of knowledge on the efficacy of the technology.
Historically, the distribution of these credits has been regressive. Research by Borenstein and Davis (2016) demonstrated that the top income quintile received the vast majority of federal energy tax credits, a phenomenon attributed to the high capital requirements of renewable technology and the non-refundable nature of the incentives [2]. However, the economic landscape of renewable energy has shifted drastically since the inception of these credits. As the cost of solar photovoltaics (PV) has fallen by more than 85% since 2010 [3], the absolute financial barrier has lowered, theoretically opening the market to a broader demographic. Yet, adoption remains uneven across geographic and demographic lines, suggesting that factors beyond simple income are influencing participation [2].
While existing literature has extensively documented income disparities, the influence of non-monetary demographic factors, particularly educational attainment, remains under-investigated. Furthermore, few studies have analyzed the temporal stability of these relationships to determine if the drivers of adoption have evolved as technology costs have plummeted [4]. This research is relevant as understanding these shifting drivers is important to ensure that the benefits of a green economy are shared by all segments of society and for designing future policy interventions that look beyond financial subsidies to address administrative or educational barriers.
The purpose of this study is to analyze a longitudinal dataset merging Internal Revenue Service (IRS) Statistics of Income data with American Community Survey (ACS) census data from 2011 to 2022 to isolate the impacts of educational attainment on adoption rates [5, 6]. We hypothesized that if the market maturity and affordability of renewable energy technologies have increased over the last decade, then the predictive importance of household income on tax credit adoption will decline while the importance of educational attainment will increase, because as financial barriers recede, the primary obstacles to adoption shift toward the administrative complexity of filing tax forms and the knowledge required to understand the advantages of adopting new systems [7].
MATERIALS AND METHODS.
This analysis relies on a combined dataset constructed by merging two data sources at the Zip Code Tabulation Area (ZCTA) level.
The first is from the IRS Statistics of Income (SOI) program. Annual “Individual Income Tax Statistics by Zip Code” files were used for tax years 2011 through 2022. To ensure specific geographic granularity, the data was filtered to exclude Adjusted Gross Income (AGI) brackets, retaining only the aggregate totals for each zip code. Key variables extracted included N1 (Total Returns), A00100 (Adjusted Gross Income), N07260 (Number of Residential Energy Credit claims), A07260 (Amount of Residential Energy Credit), N18500 (Real Estate Taxes paid), and ELDERLY (Returns claiming age 65+ exemptions). All financial figures were normalized by the number of returns (N1) to create per-capita rates, preventing population density from causing skew in the model.
Then, this data was combined with American Community Survey (ACS) data. The ACS dataset has demographic data sourced from the U.S. Census Bureau’s ACS 5-Year Estimates. Specifically, median age and percent Bachelor’s, the percentage of the population aged 25 and older holding a Bachelor’s degree or higher, were extracted.
Random Forest Regression
To quantify the shifting influence of demographic factors over time, a Random Forest Regressor for each year from 2011 to 2022 was employed. The Random Forest algorithm is an ensemble learning method that constructs multiple decision trees during training. For regression tasks, it outputs the mean prediction of the individual trees. We selected this model over standard Ordinary Least Squares regression because demographic variables (like age and income) often exhibit non-linear relationships with adoption. For example, adoption may rise with income up to a saturation point and then plateau. A linear model would not model this correctly.
For each year, the model was trained to predict the adoption rate using the feature vector with average income, median age, and percent Bachelor’s. The model used 100 estimators (trees) with a maximum depth of 12 to prevent overfitting.
The primary output metric was Feature Importance. In a Random Forest, importance is calculated by measuring how much the tree nodes that use a specific feature reduce the variance of the target variable (adoption rate) across all trees in the forest. By plotting these importance scores over the 12-year period, we visualized the structural shifts in what drives adoption.
Cross-Sectional Analysis: K-Means Clustering
To identify distinct community archetypes in the 2022 tax year, we employed K-Means Clustering, an unsupervised learning algorithm. Five key variables were selected: adoption rate, average income, median age, percent Bachelor’s, and average credit size (total amount claimed / number of claims).
Because these features had different units, a StandardScaler (Z-score normalization) was applied to transform all features to a mean of 0 and a standard deviation of 1. Then, k=3 was set as the number of clusters to identify three broad market segments. The K-means algorithm iteratively assigned zip codes to the nearest centroid to minimize within-cluster variance.
Effect of Education Analysis
To test the hypothesis of education level effect, a specific economic subset of the data was isolated: middle-income zip codes. This was defined as zip codes with an average AGI between $75,000 and $125,000. The group was then segmented into educational tiers (Low vs. High % Bachelors) and calculated the mean average credit size for each tier. These groups were labeled Low Edu, Mid Edu, and High Edu zip codes, referring to zip codes where <25%, 25-50%, and 50%+ of residents have Bachelor’s Degrees, respectively.
RESULTS.
The Random Forest analysis revealed a significant structural shift in the drivers of adoption over the decade (Figure 1). From the income standpoint, it is seen that in 2011 average AGI was the dominant predictor with a feature importance score of 0.65. By 2022, this score declined to 0.43. While income remains the strongest single factor, its explanatory power has diminished by approximately 33%.

The importance of educational attainment rose from 0.17 in 2011 to 0.29 in 2022. This trend suggests that as technology costs decrease, the barrier to adoption is transitioning from financial capital to knowledge of technology. The importance of community age stabilized around 0.27, acting as a persistent secondary filter on adoption.
The K-Means clustering analysis partitioned US zip codes into three distinct archetypes based on their socioeconomic profiles and adoption behaviors that go beyond income tiers (Figure 2). Cluster 0, representing middle-income households, was the primary driver of the residential energy transition. Despite possessing only moderate levels of income and educational attainment relative to the highest tier, this cluster demonstrated the highest overall adoption rate and the greatest investment magnitude (Table 1).

| Table 1. Demographic characteristics of the clusters emerging from K-Means. | |||||
| Cluster | Adoption Rate (%) | Average Income ($) | Median Age | Percent Bachelor’s Degree | Average Credit Size ($) |
| 2 | 1.7301 | 68,830.431 | 40.5633 | 24.9567 | 1,213.814 |
| 0 | 3.1499 | 83,965.874 | 41.6476 | 30.9284 | 3,241.947 |
| 1 | 2.0806 | 199,267.461 | 42.5977 | 62.2904 | 2,341.559 |
Conversely, Cluster 1, high-income households, revealed a counter-intuitive trend in regard to financial capacity. Though this group is characterized by the highest average adjusted gross income and educational attainment, their participation in the tax credit program was markedly lower than the middle-income cohort. Both the adoption rate and the average credit size for these wealthy communities trailed the middle-income households (Table 1).
Finally, Cluster 2, represents communities facing significant barriers to participation. Characterized by lower average incomes and educational attainment, this group exhibited the lowest adoption rates across the dataset (Table 1). Furthermore, the significantly lower average credit size in this cluster suggests that when these households do participate, their investments are limited to lower-cost maintenance or building improvements rather than high-capital large projects. This disparity highlights the structural limitations of non-refundable tax credits in reaching households with lower income and capital constraints.
In investigating the difference in how education levels affect investment, it was found that the amount of credit claimed (a proxy for the size and complexity of the installation) depended on education, as shown in Figure 3. When zip codes with incomes between $75,000 and $125,000 were isolated, the amount of credit claimed by High Education zip codes was 9.2% more than those in the Low Education zip codes. However, this gap was not statistically significant. Interestingly, the middle level of education had the highest level of credit claimed, claiming an average credit 15.34% greater than the Low Education and 5.62% greater than the High Education groups. The Middle Education group had a higher level of investment compared to the Low Education group that was statistically significant in a 95% confidence interval.

To further understand how different levels of education affect investment frequency and magnitude, we analyzed the correlation between specific educational milestones and tax credit behavior within all middle-income zip codes (Figure 4). Communities with high densities of Associate’s degrees exhibited the strongest positive correlation with adoption rate (r=0.11) but a negative correlation with credit size (r=-0.16). Conversely, populations with Bachelor’s and Graduate degrees showed a weaker correlation with adoption rate but the strongest positive correlations with investment depth (r>0.20). This result is in line with others that while vocational-level education drives high-volume, lower-cost participation (likely efficiency upgrades), more advanced academic credentials are the primary driver for high-capital projects.

DISCUSSION.
This study analyzed the temporal shift of demographic drivers for residential energy tax credit adoption to determine if the determinants of participation have shifted as renewable technologies have matured. The hypothesis that the predictive importance of household income would decline while the importance of educational attainment would increase over the 2011–2022 period was supported by the longitudinal analysis. The Random Forest regression modeling indicated that while income remains the primary predictor of adoption, its feature importance score decreased by approximately 33% (falling from 0.65 in 2011 to 0.43 in 2022). Concurrently, the predictive power of educational attainment nearly doubled, rising from a score of 0.17 to 0.29 over the same period. The importance of age also increased, going from 0.18 in 2011 to 0.27 in 2022, showing how another demographic factor became more important for residential energy investment. This confirms that as the financial barrier to entry lowers due to falling technology costs, the primary friction point for adoption is increasingly related to non-monetary factors associated with education, such as the ability to navigate administrative complexity or technical understanding of building systems.
The cross-sectional analysis of the 2022 tax year provided specific evidence regarding the nature of this shift. The K-Means clustering revealed that the highest adoption rates (3.15%) and investment magnitudes ($3,242) were not found in the wealthiest zip codes (Cluster 1, average income $199,267), but rather in middle-income communities (Cluster 0, average income $83,966). This suggests that high financial capacity can be correlated to a point of diminishing returns regarding incentive utilization. Furthermore, when controlling for income within the middle-class band ($75,000–$125,000), a statistically significant disparity in investment depth emerged. Communities with mid-level education (25%–50% Bachelor’s degrees) claimed the highest average credit amount ($2,067), significantly higher than low-education communities ($1,792). Additionally, the correlation analysis (Figure 4) indicated that while Associate’s degrees correlated with higher frequency of adoption, Bachelor’s and Graduate degrees correlated with higher credit sizes (r > 0.20). This implies a divergence in technology adoption: vocationally educated households may be more likely to engage in lower-cost efficiency upgrades, while households with advanced academic credentials are more likely to undertake capital-intensive projects like solar panels.
These findings expand upon the foundational work of Borenstein and Davis (2016), who characterized the early distribution of energy tax credits as highly regressive [2]. While their analysis accurately reflected a market dominated by high upfront costs, this study demonstrates how that market has changed. The decline in income’s predictive power aligns with the reduction in solar photovoltaic costs [3]. However, the rising importance of education suggests that the Pigouvian subsidy model—which assumes rational actors will adopt technologies once the price is right—fails to account for the administrative burden of filing tax forms and the informational burden of vetting complex home infrastructure projects; the effect of the administrative burden is well-documented and shows that even those eligible for credits may not be able to claim them [8]. The tax code appears to favor those with the administrative literacy to navigate it, rather than simply those with the most capital.
Gillingham and Palmer (2014) identify these informational asymmetries as key drivers of the ‘Energy Efficiency Gap,’ where households fail to make profitable investments due to hidden costs [7]. Our results indicate that educational attainment serves as a mechanism to lower these additional costs. Households with higher education levels possess a comparative advantage in navigating bureaucratic hurdles [9]. However, this does not fully account for why zip codes with those in 25 to 50% of residents with Bachelor’s degrees invest more than zip codes with greater than 50% of residents with Bachelor’s degrees, as seen in Figure 3. A possible explanation for this is that college graduates are more likely to travel and migrate within the US, meaning that they will not invest heavily in technology that has a long breakeven period – those with lower education levels are more geographically stable [10]. Nonetheless, this finding should be investigated further in future studies by analyzing further the demographic characteristics of “middle education” zip codes.
To address this disparity across educational levels, a possible policy change to implement is a point-of-sale rebate or discount, rather than a tax refund after filing tax forms. This would increase the salience of residential energy savings, which has been shown to drive consumption [11].
A primary limitation of this study is the reliance on aggregated Zip Code Tabulation Area (ZCTA) data, which introduces the risk of grouping together households with different circumstances; trends observed at the community level may not perfectly reflect individual household decision-making. The study also did not distinguish state and local policy, rather focusing on federal tax incentives.
Future research should also aim to disaggregate the specific types of technologies claimed under Section 25D and 25C to definitively determine if educational attainment dictates the choice between different types of energy efficient technology. Additionally, experimental studies could evaluate whether simplifying the claiming process—like converting from a tax credit to a point-of-sale rebate—would close the adoption gap between low and high-education communities.
This research is especially relevant as policymakers continue to drive the growth of renewable energy and energy efficient technologies in the US. With the primary driver of this policy being tax credits, studying the distributional effects of how they are used in the US population is critical. By understanding that demographics beyond income play a part in the ability to use tax credits, future policy can be designed to account for these non-monetary barriers, and this research can be used as a justification for targeting low-education zip codes across the US.
CONCLUSION.
By using both a Random Forest and a K-means machine-learning based approach to the analysis of Residential Energy Tax Credits, this study shows a temporal shift in the drivers of energy tax credit usage over the last decade from largely financial capital to factors associated with administrative literacy. While income remains a primary predictor, its predictive power has significantly diminished over the last decade, superseded by a strong correlation with educational attainment. However, this relationship is not strictly linear; a distinct gap emerged where households with mid-level education demonstrated the highest investment rates. Complemented with the finding that higher degrees are correlated with greater investment amounts, this indicates a divergence in the types of technologies adopted. The study also highlights the limitations of the current Pigouvian subsidy model, which fails to account for the significant administrative burdens that disproportionately affect less-educated households. Further research is crucial to distinguish whether these disparities stem from the complexity of the tax code or from other demographic factors. Addressing these non-monetary barriers by potentially enacting point-of-sale rebates is vital to ensuring that the benefits of the energy transition are shared by all in society.
ACKNOWLEDGMENTS.
Thank you for the guidance of Katelyn Wagner, my economics teacher from Princeton High School, in the development of this research paper.
SUPPORTING INFORMATION.
The exact code for the models mentioned can be found in the Supporting Information file.
REFERENCES.
- M. Hymel, The United States’ Experience with Energy-Based Tax Incentives: The Evidence Supporting Tax Incentives for Renewable Energy. Loyola University Chicago Law Journal 38 (2006).
- S. Borenstein, L. W. Davis, The Distributional Effects of US Clean Energy Tax Credits. Tax Policy and the Economy 30, 191–234 (2016).
- “Renewable Power Generation Costs in 2020” (International Renewable Energy Agency, 2021).
- D. Feldman, V. Ramasamy, R. Fu, A. Ramdas, J. Desai, R. Margolis, “U.S. Solar Photovoltaic System and Energy Storage Cost Benchmark (Q1 2020)” (NREL/TP-6A20-77324, 2021).
- Internal Revenue Service, Individual Income Tax ZIP Code Data, SOI Tax Stats. https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi.
- United States Census Bureau, Data Profiles, Census.gov. https://www.census.gov/programs-surveys/acs/.
- K. Gillingham, K. Palmer, Bridging the Energy Efficiency Gap: Policy Insights from Economic Theory and Empirical Evidence. Review of Environmental Economics and Policy 8, 18–38 (2014).
- P. Herd, D. P. Moynihan, Administrative Burden: Policymaking by Other Means (Russell Sage Foundation, 2019).
- A. Masood, M. Azfar Nisar, Administrative Capital and Citizens’ Responses to Administrative Burden. Journal of Public Administration Research and Theory 31, 56–72 (2021).
- M. Lawson, “Changing Migration Patterns: A review of popular press and scholarly analysis” (Headwaters Economics, 2014).
- R. Chetty, A. Looney, K. Kroft, Salience and Taxation: Theory and Evidence. American Economic Review 99, 1145–1177 (2009).
Posted by buchanle on Tuesday, June 2, 2026 in May 2026.
Tags: educational attainment, incentives, random forest regressor, residential energy tax credit, temporal analysis
