Controllable AI-Driven Chord Generation Using Transformer Language Models

ABSTRACT

Harmonic progressions are an essential part of music composition. Creating progressions that feel coherent and stylistically consistent is difficult for both musicians and machine learning systems. A coherent progression is one where each chord follows logically from the last, stays grounded in the key, and creates a sense of direction and resolution. This study presents a controllable transformer-based model that generates symbolic chord progressions using a custom musical token system. The system can read a user MIDI file, extract key information, and convert the musical material into Roman numeral representations that describe functional harmony rather than raw pitch data. Additional conditioning tokens act as prompts that describe the desired musical context, including key, number of measures, groove, energy level, and harmonic complexity. These prompts guide the transformer during generation, allowing users to influence the structure and character of the output. A custom transformer model was trained on a processed dataset of harmonic examples, and a decoding pipeline converts the generated Roman numeral progressions back into MIDI for playback. The model produces harmonic sequences that remain consistent with the intended key and respond in clear ways to changes in user prompts. This work shows that transformer language models can generate coherent and understandable harmonic material when paired with structured symbolic representations and controllable prompts drawn from user input.

INTRODUCTION.

Music composition relies on the creation of chord progressions that support the melody and shape the emotional quality of a piece. Although many listeners recognize certain progressions as familiar or pleasing, creating progressions that remain coherent and stylistically appropriate requires a strong understanding of functional harmony. Functional harmony describes the way chords relate to the key and to one another, and it forms the basis of most Western musical structures. Learning these patterns can take many years, and even experienced musicians often explore progressions through repeated trial and error. The challenge runs deeper than selecting from a finite set of notes. Chords must create tension, resolve it, and stay grounded in a key across an entire progression. A single misstep makes the music feel random or unfinished, even when every individual note is technically correct

Recent advances in artificial intelligence have made it possible for machine learning models to learn complex patterns in sequential data. The introduction of transformer-based models demonstrated that attention mechanisms can capture long range structure in a wide variety of sequences [1]. Later work showed that these models could perform many tasks using the same underlying architecture, which encouraged researchers to apply them to music [2]. In symbolic music generation, models such as the Music Transformer were able to produce sequences with longer and more coherent structure [3]. However, many existing systems generate music directly from raw pitch information, representing notes as absolute MIDI values without any reference to key or harmonic function, which makes the output difficult to interpret and difficult to control. Musicians often need systems that can respond to specific musical instructions, remain stable in a desired key, and produce output that can be edited or understood within traditional music theory frameworks.

Symbolic representations of harmony, such as Roman numerals, offer a clearer way to describe how a chord functions within a key. These representations have been used in music theory for many decades and allow harmonic patterns to be expressed in a general form that is independent of absolute pitch. Research tools such as music21 have shown that symbolic representations can be extracted from MIDI data and analyzed computationally [4,5]. This makes it possible to design artificial intelligence systems that learn harmonic behavior rather than just sequences of pitches.

The goal of this study is to create a controllable system that generates chord progressions in response to user input. The system reads a user MIDI file, identifies the key, and converts the harmonic content into Roman numerals. Additional tokens that describe the desired number of measures, groove, energy level, and harmonic complexity act as prompts that guide the generation process. A transformer model is then trained on a dataset of symbolic chord progressions and used to generate new progressions that follow the requested musical conditions.

The hypothesis of this work is that a transformer model trained on symbolic harmonic representations can generate coherent and key consistent chord progressions that respond in predictable ways to user prompts. This study investigates whether symbolic encoding and prompt-based conditioning can make artificial intelligence systems more interpretable and more useful for musicians.

MATERIALS AND METHODS.

Dataset and Data Processing.

The model was trained using the SHLD Free MIDI Chord Packs, an open-source collection of more than ten thousand MIDI files created by Ludovic Drolez and distributed under the MIT License [6]. The dataset contains one hundred sixty-five chord progressions for each major and minor key, along with an additional set of modal progressions. These progressions are grouped by key and organized into directories that include triads, seventh, and ninth chords, and extended modal progressions (Figure S1), each with several rhythmic variations. Because the dataset provides functional progressions labeled by key, it serves as a consistent foundation for symbolic harmonic learning.

Each MIDI file was processed using the pretty MIDI library [5] to extract pitch classes—the twelve distinct note names in an octave, independent of which octave they appear in—along with onset times, durations, and other performance descriptors. Key labels were obtained directly from the directory structure of the dataset, which aligns each progression to its tonic. The tonic is the central note that gives the key its name and serves as the primary point of resolution in the progression. All chords were then converted into Roman numerals using the music21 toolkit [4]. The scale degree of each chord describes its position to the tonic within the key. This was computed using

\[d=\left(p_{root}-p_{key}\right)\ mod\ 12\tag{1}\]

where \(p_{root}\) is the pitch class of the chord root and \(p_{key}\) is the pitch class of the tonic. The modulo 12 operation shows the fact that an octave contains twelve distinct pitch classes. The resulting degree \(d\) was mapped to the corresponding Roman numeral, producing a functional harmonic representation independent of absolute pitch. Each symbolic progression was then stored as a sequence of tokens together with a measure level timing structure, and saved in a line-based JSON format for training. A schematic overview of the full processing pipeline, from MIDI input through symbolic encoding and generation, is shown in Figure S2 of the Supporting Information.

Model Architecture and Training.

A transformer-based language model was used to learn the symbolic harmonic sequences described above. The model followed the transformer architecture introduced by Vaswani and colleagues [1] and used an autoregressive objective like that described by Radford and colleagues [2]. The architecture consisted of eight transformer layers, each containing multihead self-attention and a feed forward projection. The embedding dimension was set to 384 with six attention heads. Each token in the vocabulary, including Roman numerals, modal symbols, measure markers, and other special symbols, was assigned a learned embedding vector.

The model was trained to predict the next token in the sequence given all previous tokens. The loss function used for optimization was the standard cross entropy objective

\[L=-\sum_{t=1}^{T}{logP\left(x_t\middle| x_1\ldots,x_{t-1}\right)}\tag{#}\]

where \(x_t\) is the ground truth at time step \(t\), and \(P(x_t)\) is the probability assigned by the model. Training was performed using the PyTorch framework [8] together with the Hugging Face Transformers library [9]. The dataset was divided into training and validation sets with an eighty to twenty ratio. The model was trained for multiple epochs on a NVIDIA GeForce 1660 graphics processor using mini batches of symbolic sequences.

User Prompt Conditioning and Inference.

To generate chord progression for a user supplied MIDI file, the system first processed the MIDI using the pretty MIDI and music21 pipelines described above. The file was analyzed to identify the key, extract any existing harmonic material, and compute pitch class distributions. A prompt was created that included the detected key token, the number of measures requested by the user, the groove type, the harmonic energy, and the harmonic complexity. Groove describes the rhythmic feel and pattern density of the progression. Harmonic energy reflects the rhythmic activity and brightness of the chords. Harmonic complexity describes the density of chord extensions and voice movement. These conditioning tokens describe the intended musical context and are placed at the start of the input sequence.

The trained transformer model then generated a symbolic harmonic progression one token at a time. Because the model is autoregressive, each generated token was appended to the input for predicting the next token. The generation process continued until the model produced an end of sequence marker, which indicated that the progression was complete.

Symbolic Decoding and MIDI Rendering.

After generation, the symbolic chord progression needed to be converted back into a MIDI file. Each Roman numeral was reconstructed into a chord by mapping the scale degree back to an absolute pitch within the identified key. The music21 toolkit was used to determine chord tones based on functional harmony, and a consistent voicing scheme was applied to ensure uniform sound across examples. The reconstructed chords were then written into a MIDI file using pretty MIDI, with note durations and timing determined by the number of measures specified by the user.

Evaluation Procedure.

Generated progressions were evaluated using both quantitative and qualitative measures. Key consistency was measured by calculating the proportion of chords that belonged to the diatonic set of the detected key. Structural coherence was assessed by examining the distribution of chord functions across the progression and comparing them to typical harmonic patterns. Additional evaluation was performed by musicians who rated whether the generated progressions reflected the requested groove type, harmonic energy, and harmonic complexity. These evaluations were used to determine the extent to which symbolic conditioning produced musically meaningful and predictable results.

RESULTS.

The system was evaluated by generating harmonic continuations for user supplied MIDI files and analyzing both the musical structure and the quantitative behavior of the transformer model. The goal of these evaluations was to determine whether the model produced progressions that were coherent, key consistent, and responsive to user defined conditioning tokens. The following results illustrate both qualitative and quantitative aspects of the system’s performance.

As shown in Figure 1, low energy progressions feature long sustained notes while high energy progressions feature short rapid attacks. Low harmonic complexity produces sparse voicings with fewer chord tones, while high complexity produces denser layered harmonies.

**Figure 1.** The two conditioning parameters used to guide generation. Energy level controls rhythmic activity and harmonic complexity controls chord density.

To evaluate the system qualitatively, we compared the user supplied input to what the model generated. Figure 2 shows the original MIDI file uploaded by the user, along with the text prompt describing a high energy, moderately complex chord progression in D major. Figure 3 shows what the model produced in response. Looking at Figure 3, the output has two clear layers — the bottom half shows long, sustained chords that form the harmonic backbone of the progression, while the top half shows faster, more active notes moving in an arpeggiated pattern. Together these reflect exactly what the prompt asked for: the chords provide harmonic stability while the rhythmic activity in the upper register creates the bright, energetic character. The chord tones are also vertically aligned across measures, showing that the model produced structured triadic and seventh chord voicings rather than random pitches.

**Figure 2.** User supplied MIDI input shown as a piano roll. AI generated harmonic continuation produced by the transformer-based model in response to a prompt specifying key, scale, energy level, and harmonic complexity.

**Figure 3.** AI Generated harmonic continuation produced by the transformer-based model in response to the prompt shown in Figure 1, displayed on a piano roll in D major.

To assess the tonal stability of the system, the percentage of chords belonging to the diatonic set of the detected key was computed for each generated progression. If a progression remained strictly within the key of D major, for example, a maximum value of one would be obtained. In practice, the model occasionally introduced nondiatonic chords, but most progressions achieved values greater than zero point eight, as illustrated in Figure 4.

**Figure 4.** Key consistency across ten generated chord progressions. Each bar represents the percentage of chords in a progression that belong to the diatonic set of the detected key, demonstrating that most samples maintain more than eighty percent diatonic content.

This suggests that the model learned the dominant harmonic tendencies of the dataset and was able to preserve key structure when generating new material. The consistency of these values across samples indicates that key stability was not an isolated outcome but a general property of the model.

Beyond tonal stability, the effect of the conditioning tokens was evaluated using four controlled generations that varied only in energy level and harmonic complexity. The low energy output (Figure S3) produced long sustained chords with minimal upper voice movement, resulting in a calm and lightly textured progression. In contrast, the high energy continuation (Figure S4) exhibited short, percussive chord strikes and increased rhythmic density, creating a brighter and more dynamic harmonic surface. A similar contrast was observed in the complexity conditioned generations. The low complexity sample (Figure S5) contained primarily triads with limited added tones, while the high complexity output (Figure S6) featured dense voicings, chord extensions, and richer harmonic motion. These controlled examples demonstrate that the model responds predictably to symbolic conditioning tokens and that changes in energy and harmonic complexity produce musically meaningful differences in the generated progressions.

DISCUSSION.

The results of this study support the hypothesis that a transformer model trained on symbolic harmonic representations can generate coherent, key consistent chord progressions that respond in predictable ways to user prompts. The model not only maintained tonal stability throughout its generations but also demonstrated an ability to produce harmonic motion that aligns with established functional patterns. This behavior suggests that training on symbolic Roman numeral sequences allows the model to internalize general harmonic relationships rather than memorizing isolated examples from the dataset.

The strong key consistency observed in the generated progressions provides direct evidence that the model learned the tonal structure present in the training data. Most progressions contained more than eighty percent diatonic chords, indicating that the model could reliably preserve the underlying key even as it introduced nondiatonic tones for expressive effect. This supports the hypothesis that symbolic encoding helps create an interpretable system whose decisions can be connected to familiar concepts in music theory.

Models like Music Transformer are impressive at generating long, expressive piano performances directly from raw pitch data. But that power comes at a cost — the output is hard to interpret and even harder to control. A musician cannot easily tell the model to make something calmer, or more complex, or rooted in a specific key. This system takes a different approach. By working with symbolic Roman numeral representations instead of raw pitch, every decision the model makes can be connected back to familiar music theory concepts. Want a high energy progression in D major with dense chord voicings? The model responds directly to that. The trade-off is scope — this system generates chord progressions, not full performances. There is no melody, no expressive timing, no dynamics. But that is kind of the point. This is a tool for musicians to build on, not a replacement for human creativity. Music Transformer and this system are not really competing — they are solving different problems, and used together they could cover a lot of musical ground.

The conditioning experiments further demonstrate that the model responds in predictable and musically meaningful ways to user prompts. The contrast between low and high energy generations shows that the model associates higher energy with greater rhythmic activity and shorter chord durations, while lower energy prompts produce smoother textures with longer sustained harmonies. Similarly, the differences between low and high complexity outputs indicate that the model understands harmonic richness as the presence of chord extensions, expanded voicing, and increased voice movement. These structured and consistent behaviors across Figures S3 through S6 show that prompt-based conditioning allows users to shape the expressive character of the generated progression with a clear and interpretable effect.

Together, these findings indicate that symbolic representations and conditioning tokens create an effective framework for controlling harmonic generation. The model’s predictable responses to energy and complexity prompts, as well as its stable tonal behavior, demonstrate that transformer-based systems can be interpretable and useful tools for musicians seeking harmonically coherent and customizable chord progressions.

CONCLUSION.

This study demonstrated that a transformer model trained on symbolic harmonic data can generate coherent and key consistent chord progressions that respond in predictable ways to user defined prompts. By combining Roman numeral encoding with conditioning tokens, the system produced harmonically stable progressions while allowing users to influence expressive qualities such as energy level and harmonic complexity. These findings show that symbolic representations can make artificial intelligence systems more interpretable and musically useful.
Several limitations of the present work suggest opportunities for improvement. The system generates progressions offline rather than in real time, which limits its use in live performance or interactive composition. The conditioning space is currently restricted to symbolic tokens and does not capture subtler musical attributes such as articulation, groove, or long-range phrasing. In addition, while the chord pack dataset is extensive, it does not represent the full harmonic diversity found in modern genres, which may limit generalization.

Future work may extend this framework to additional musical layers, such as basslines, melodies, or drum patterns, enabling the system to generate multi-voice textures that form more complete musical ideas. Improving computational efficiency to support real-time generation would further expand the system’s creative potential. Additional evaluation through listener studies or genre-specific datasets could also deepen understanding of how users perceive the model’s outputs.

Overall, this study provides evidence that symbolic encoding and prompt-based conditioning offer a promising path toward controllable and interpretable AI-assisted music generation. As these systems continue to evolve, they may become powerful tools for musicians, composers, and producers seeking new ways to explore harmonic structure and creative expression.

The use of AI in music composition also raises ethical and legal considerations. Recent legal rulings have determined that works produced entirely by AI may not qualify for copyright protection, creating uncertainty around ownership when AI tools are used in creative practice. There is ongoing debate about whether AI assistance affects the authenticity of musical work. These concerns do not have definitive answers, but awareness of them is useful for composers choosing to use these systems.

ACKNOWLEDGMENTS.

I would like to thank Justin DeLay and Nitin Jayakumar for their support throughout this project. Justin inspired much of the direction of this research and encouraged me to explore symbolic approaches to harmonic modeling. Nitin, as a researcher himself, provided valuable guidance and offered to review and refine the work in any way he could. Their contributions and mentorship were greatly appreciated.

SUPPORTING INFORMATION.

Supporting Information includes Figures S1 through S6, showing the chord types used in the dataset, the full system pipeline, and conditioning-based generation examples for low and high energy levels and low and high harmonic complexity.

REFERENCES

A. Vaswani, et al., Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, “Language models are unsupervised multitask learners” (OpenAI, San Francisco, CA, 2019).
C.-Z. A. Huang, et al., Music transformer: Generating music with long-term structure. Int. Conf. Learn. Represent. (2019); https://openreview.net/forum?id=rJe4ShAcF7.
M. S. Cuthbert, C. Ariza, music21: A toolkit for computer-aided musicology and symbolic music data. Proc. 11th Int. Soc. Music Inf. Retrieval Conf., 637–642 (Utrecht, Netherlands, 2010).
C. Raffel, D. P. W. Ellis, Intuitive analysis, creation, and manipulation of MIDI data with pretty_midi. Late-Breaking Demo, 15th Int. Soc. Music Inf. Retrieval Conf. (Taipei, Taiwan, 2014), paper LBD29; https://ismir2014.ismir.net/LBD/LBD29.pdf.
L. Drolez, SHLD Free MIDI Chord Packs (GitHub, 2023); https://github.com/ldrolez/free-midi-chords
Zakarii, Lofi hip hop MIDI (Kaggle, 2023); https://www.kaggle.com/datasets/zakarii/lofi-hip-hop-midi
A. Paszke, et al., PyTorch: An imperative style, high performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8024–8035 (2019).
T. Wolf, et al., Transformers: State-of-the-art natural language processing. Proc. 2020 Conf. Empirical Methods Nat. Lang. Process. (System Demonstrations), 38–45 (Association for Computational Linguistics, Online, 2020).

Posted by buchanle on Tuesday, May 19, 2026 in May 2026.

Tags: Creative AI, Machine learning, MIDI Analysis, Music Generation, Transformer Model