-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resources on fine tuning model & local exectution #12
Comments
@mightimatti Hi Thank you for writing and appreciating my work. It means a lot to me :) Unfortunately, there is no local (cli python) training and MIDI processing scripts just yet. However, there are colab notebooks that would allow you to process midis if you want to fine-tune. https://github.com/asigalov61/Allegro-Music-Transformer/tree/main/Training-Data At this dir you will find MIDI processing colab and auto-generated python script that you can use to process files locally. However, Allegro does not have stand-alone fine-tuning script, which is why I wanted to invite you to check out my Giant Music Trasformer model/implementation. It is much better suited for fine-tuning and it has a stand-alone fine-tune colab/code ready to be used. https://github.com/asigalov61/Giant-Music-Transformer https://github.com/asigalov61/Giant-Music-Transformer/tree/main/Fine-Tune In regard to generating without seeds, you can find such generator code for both Allegro and Giant in the Original Version of the generator colab under Improv Generation: Giant Music Transformer also has bulk improvs and continuations generations which you can use to generate from scratch: Hope this answers your questions and I hope it is not too confusing since there is a lot :) But if you need more help, especially with fine-tuning, feel free to ask :) Alex |
Hi Alex, I apologize for the many questions, but I am happy to share my code once I successfully train my model as this might be useful to others who wish to train this. The reason I would like to train a local instance is to allow me to train on a cloud instance from Vast.ai |
@mightimatti Np. I am happy to help :) While somewhat similar, Allegro data format is still a bit different from the Giant, so you have to use Allegro MIDI processor if you want your data to be compatible with Allegro. Allegro MIDI processor code is located here but it will need some clean-up since it was auto-generated by Google Colab: So use this to process your MIDIs and then use Allegro Training Code to train or fine-tune: Other than that, I would be happy to add your local training code version to the repo if you will make a PR once you do it :) I really apologize for not having proper local versions for MIDI processor code and training code. The reason why is that it is more practical and convenient to have it as collabs as many people do not have local GPUs for Music AI. I also wanted to suggest you to check out Lambda Labs (https://lambdalabs.com/) for training and inference. It has very good prices and also it is fully compatible with Jupyter/Google Colab notebooks which makes it easy to do all this stuff. Hope this helps. Alex |
Thank you again. I'll let you know how it progresses :-) Is there a specific reason why you prefer git cloning TMIDIX as opposed to just installing it via When I last tried using Lambda, about a year ago, I couldn't get it to recognize my european credit card, so I never got to use it. It does look good though |
@mightimatti Yes, those auto-generated scripts need some clean-up I clone TMIDIX/tegridy-tools because it is most up-to-date version, That is correct. I am not very handy with PyPl/pip packages so I never get to update it really. So git cloning is the best way to get the latest version. I see about Lambda. Well, anyway, let me know if you need more assistance. I am here to help. And feel free to PR local implementation if/when you will finish it because I think others may find useful as well :) Alex |
Hi Alex, I was able to get things to run locally and I am having mixed results with the model so far. I'll give you a little background on what I am trying to do, as I have hit a roadblock in moving forward with my use of the model and I need a little guidance. We switched over to the Giant Music Transformer as this is what you reccomend: My homework group and I have spent a substantial amount of time repurposing the Giant music Transformer you recommend for the specific task of generatign 4-voice choral compositions as part of an experimental AI course at university. We are trying to generate 4-channel MIDI compositions which are reminiscent of choir pieces. I succeeded in preparing approx. 8K midi files which contain 4 distinct channels and am now trying to modify the dataset maker to pitch the files to CMaj(or AMin in the case of a minor mode), remap the channels to channels 0-3 and finally run the preprocessor as usual. What I have come to realize, is that the Giant Music Transformer, even in it's basic(non-XL) form is way larger(more parameters) than what I suspect I need, as all my music is in the same key and 4-monophonic channels are used, as opposed to 16 arbitrary(polyphonic) ones. I have had somewhat reasonable results training overnight with ~1m parameters on Johan Sebastian Bach's collected choral works(albeit with all instruments mapped to a single channel) and would now like to retrain, this time with 4 channels. I am modifying the preprocessor to this end and came across the following questions I can't seem to clarify: Where does the constant for the model parameter
Where is the this number coming from, what I can do to reduce the dimensionality of the Embedding space or calculate the required size? Is there a chance you could give me a brief overview of the steps that the Preprocessor goes through and briefly describe the representation of the Training instances? Maybe a higher level understanding will allow me understand the rest on my own. Btw I haven't forgotten about sharing my local execution code so others can use it, but I am waiting to finish my project and share the (polished) results. Sorry for the many questions and thank you for your availablity so far. |
@mightimatti Your are welcome :) Thank you for appreciating my work and please know that I am always happy to help :) To answer your question... I am sorta old-school when it comes to coding so my code can be hard to read, for which I apologize... But basically, num_tokens was calculated by adding all encoding tokens together+pad_idx+1. So the number that was hard-coded in the original processor colab is the exact number of tokens needed for all features of the implementation. To give you a specific exact breakdown: 512(delta_start_time) + 4096 (duration_velocity) + 16641 (patch_pitch) + 1024 (bar_counter) + 512 (bar_time) + 128 (bar_pitch) + 1 (outro_token) + 2 (drums_present ) +129 (intro_seq_patch) + 1 (intro/SOS) + 1 (EOS) + 1 pad_idx == 23047 This is for the original/full-featured version. Pre-trained models were made with a simplified version of the same so that they can fit into consumer GPUs at reasonable # of batches. I hope this clarifies how the num_tokens was calculated. And yes, for your purposes you can probably make a smaller/striped down version of the GMT since you only have 8k midis and only 4 channel/short generation requirements. However, you should try fine-tuning too since it may produce good results as well. I ran a few fine-tuning experiments with GMT and fine-tuned versions play really well if the generation seq_len is not too long. Anyway, let me know if you need more help or if you need me to elaborate more on the GMT MIDI processor. Alex. |
Hi Alex, I ran a training all night with my new Dataset and the sequence length set to 4K and came to realize that my 900K parameter model is probably underfitting with this sequence length. ATM I am not able to reduce the dimensionality of the embedding space, because I don't understand the parameters you mention and how I would go about reducing them. I don't understand how you determined, as an example, that the
I'm sorry if this seems obvious to you, but I really don't understand the intermediate representation/embedding and I believe this is key to understanding both the model and the synthesis of audio.
Sorry for the many questions. FWIW I will try and write my code in such a way that maybe future users can benefit from my more generalized preprocessing script! |
@mightimatti Yes, I would be happy to help you here :) You are definitely going in the right direction with all this :) First of all, you need to properly prep/process MIDIs. From my experience, it is important to use 1024 embed (this is a minimum) and also appropriate seq_len which will be close to the average length of your MIDIs when encoded as you prefer it. 4096 length is a good length if you use triplets or quad encoding... I will explain and show below... So I will begin by answering your questions...
Yes, TMIDIX is the best way to go about it because it has everything you may need for your project and I finally streamlined it in my latest update last month so definitely try it out. To help you out further,..I wanted to invite you to use the shared Google Colab below so that I can show you how you can do your project and also to help you get familiar with TMIDIX functions/approaches. https://colab.research.google.com/drive/1XALSVLcnCqYvPiyv6ZWl99Q0a4gGIh3o?usp=sharing This colab uses my POP Melody Transformer implementation as a core code, but we will modify it for your needs and specs shortly so do not worry if it says POP Melody Transformer. So what I need from you to help you effectively, is a few example MIDIs from your dataset. I hope you can share them here as a zip attachment so that I can take a look at it and adjust the colab code appropriately for you. Alex. |
Hi Alex, I have read your answer multiple times and come to realize that my misunderstanding of MIDI led me to use channels the way I should have used Patches... I used them to encode different tracks/instruments. I think I understood quite a bit more about the GMT in going through the code for the past hour with the answers you gave me. While the next step for me is making modifications to incorporate these insights into my version of the Dataset maker, it is also complicated by the code being a bit of a mess, as I believe it's an automatic export from collab, which I already adapted to run it locally on my computer. I appreciate your offer to help with this in the way of having a look at the code very much, but it is difficult for me to use the Colab you shared, as it seems to differ from the GMT's dataset maker, at least the version I am using quite a bit and as I have already made significant modification to the preprocessor to incorporate the change of key, and to generally be able to run the script locally(the original intention of this thread). Maybe I can ask you two more questions which would greatly help me:
The reason I ask is not because I am desperately trying to reduce the dimensionality, but because I feel I might not be following. Is this related to the extra patch for the drums? Or is there another reason why I need an extra 128 pitches/entries.
Thank you very much, I apologize in case my questions shouldn't be clear. It is very late over here in Berlin :-) |
@mightimatti Np, whatever works for you better :) I just thought that showing a simple colab example may make things easier to understand but if not then let me just answer your questions the best I can... :)
Now, intro and outro are simply aux tokens(seqs) to soft-condition the model for intros and outros. So for examples, I use outro tokens to generate outro (ending) of the composition when I use the model to do a continuation of the composition as the model may not always generate the ending by itself. Same with intro...I use it to soft-condition the model to generate some arbitrary beginning of the composition when I compose with the model. What else do you want to know about GMT dataset maker? Ask me specifically so that it is easier for me to explain it to you since there are a lot of features in the dataset maker. Hope this answers your current questions but feel free to ask more if needed :) Alex. PS. Check out TMIDIX advanced_score_processor function. It is based on GMT processor code and it is very handy for working with my code/implementations. Please note that you may need to pull the latest TMIDIX from tegridy-tools as I believe GMT copy does not have latest updates... |
Hi Alex, My idea is that every time data is encoded/decoded, a dictionary is created which contains all the relevant factors and offsets. This can be stored with the dataset and ensure consistent encoding and decoding. It might even help you with transferring embeddings from one model's code base to another. This is what I came up with so far. (Please disregard the def get_paramters_with_defaults(configuration_parameters, DEBUG=False):
"""
Get all relevant parameters for the parameter embedding,
overwriting the default parameters with user-provided configuration values
"""
# Default parameters. User passed params overwrite these/
PARAMS = {
# unpack parameters
# the sequence lenght the model was trained for
"model_sequence_length": 1024,
"entries_per_duration": 256,
# a value that time/duration/velocity are divided by during dataset preprocessing
# see https://github.com/asigalov61/Allegro-Music-Transformer/issues/12#issuecomment-1923632903
# octo-velocity refers to this being 8....
"temporal_downsampling_factor": 8,
"patch_count": 4,
"channel_count": 4,
# number of different values to allow as valid pitches per patch
"pitches_per_patch": 60,
# if only a subset of permissible MIDI pitches is to be used, i.e. pitches_per_patch != 128,
# offset index 0 with this value
"pitch_shift": 12,
# value that is added to the patch value
"patch_mapping_offset": 53,
# number of velocity values in a MIDI file
"number_velocities": 8,
}
# update defaul values with user-provided values
PARAMS.update(**configuration_parameters)
"""
derived values
These values are calcutlated based on the parameters above,
as they depend on these.
"""
# Number of values that duration can take.
entries_per_temporal_val = (
PARAMS["model_sequence_length"] // PARAMS["temporal_downsampling_factor"]
)
# number of different values to allow as valid pitches per patch
pitch_entries = PARAMS["patch_count"] * PARAMS["pitches_per_patch"]
duration_velocity_entries = entries_per_temporal_val * PARAMS["number_velocities"]
max_pitch = PARAMS["pitches_per_patch"] + PARAMS["pitch_shift"]
# Valid range of pitches to check against
valid_pitch = range(PARAMS["pitch_shift"], max_pitch)
# write to dict
DERIVED_VALUES = {
"entries_per_temporal_val": entries_per_temporal_val,
# "entries_per_duration": entries_per_duration,
"pitch_entries": pitch_entries,
"duration_velocity_entries": duration_velocity_entries,
"max_pitch": max_pitch,
}
"""
LIMITS
"""
# Define the indexes limiting various properties encoded in the INTs.
# The final value is the embedding dimension
LIMITS_TIME = range(0, PARAMS['entries_per_duration'])
LIMITS_DURATION_VELOCITY = range(
PARAMS['entries_per_duration'], PARAMS['entries_per_duration'] + duration_velocity_entries
)
LIMITS_PITCH = range(
PARAMS['entries_per_duration'] + duration_velocity_entries,
PARAMS['entries_per_duration'] + duration_velocity_entries + pitch_entries,
)
LIMITS = (
LIMITS_TIME,
LIMITS_DURATION_VELOCITY,
LIMITS_PITCH,
)
if DEBUG:
print("####" * 12)
print("Loaded Embedding parameters from config")
print("####" * 12)
print(f"User config: {pformat(configuration_parameters)}")
print("####" * 12)
print(f"Resulting config: {pformat(PARAMS)}")
print("####" * 12)
print(f"Derived values: {pformat(DERIVED_VALUES)}")
print("####" * 12)
string_limits = "\n".join(map(lambda x: f"[{x[0]}, {x[-1]}]", LIMITS))
print(f"Embedding ranges: {string_limits}")
print("####" * 12)
return (PARAMS, DERIVED_VALUES, LIMITS) In the process I encountered the following questions:
Once again, I apologize for asking all these quesitons. I hope you see that I have invested a lot of time into understanding this and my hope is, that as you seem interested in sharing your code, my difficulties understanding your code might help you developing it in a more accessible format. |
@mightimatti I apologize for the delayed response... Thank you for your suggestions about parameters config. This is indeed how it should be done so that the code is easier to read and to use. I will definitely consider that for my future implementations and projects. To answer your questions:
In GMT, since I use ms timings, first the MIDI is converted from ticks to ms and then I further downsample the timings range by dividing them by some value (usually 8 or 16) which produces the range of 512 or 256 respectively for delta_start_times and durations. In other words, I dowsample original MIDI timings (start times and durations) so that they could be in a reasonable range while preserving sufficient resolution to avoid damaging the music structure. For example, in my production versions of GMT, I first converted timings from ticks to ms, then divided them by 16, which gave me a range of timings values from 0 to 255 (256 values) for delta_start times and durations, which in turn is equivalent of 64 values per each second of time (1000ms), with max time being 4s (4000 ms). So if my delta_start _time is 1 second, the value to describe it would be 64. If the delta_start_time is 4 seconds, the value to describe it would be 255. Same for durations too. Hope this makes sense in regard to my timings encoding and conversion.
From my experience, MIDI encoding does not matter at all and it is best to keep it simple and as compact as possible for best results. The best types of encoding from my experience are triplets and quads for each MIDI note. It does not degrade performance while keeping things simple and efficient. You can also use asymmetrical encoding to further optimize your implementation. Asymmetrical encoding (such as was used in MuseNet) produces same results as the symmetrical one while further increasing efficiency at a minimal training loss increase. Hope this also makes sense and I am sorry if my implementation is difficult to understand. I am mostly concern with end-user experience, so I sometimes drop the ball on the code/implementation itself which amy be important for devs. Alex. |
Hi Alex, |
@mightimatti No worries. I was also very busy with my projects so I totally understand :) Yes, feel free to PR anything you think may be useful. And I would love to chat more and check out your work too :) No, I never tried 1D diffusion because I do not think it will work better than autoregressive transformers. But I wanted to try to fine-tune a diffusion model on symbolic music in form of images to see if it will work. Why do you ask though? Do you think 1D diffusion will work on symbolic (MIDI) music? Alex. PS. I did some repos clean-up and update, including tegridy-tools/TMIDIX, so check it out. I think it is much better now. |
Hi,
I came across this repository and have played around with the notebooks a little bit and succeeded in running the model locally to perform inpainting on midi files of mine.
I was wondering whether there are any resources on how one would go about fine-tuning the model with some more training(I would like to see if I can imitate the style of a specific composer such as J.S. Bach) using only Bach midi files. I don't quite understand how I need to preprocess the MIDI files to this end. Is there a training script available that lends itself to local training?
Further I was wondering whether you could outline how to generate MIDI using ONLY the prior embedded within the model(no additional Seed MIDI at inference time). Is this possible?
Thank you for the excellent model!
The text was updated successfully, but these errors were encountered: