Stage 6: Dataset Arrangement

Arrange folder in a certain format for concept balancing

If you start from this stage, please set --src_dir to the training folder to arrange (/path/to/dataset_dir/training/{image_type} by default).
In-place operation.

For more details please refer to Dataset Organization.

Command line arguments

rearrange_up_levels: This argument specifies the number of directory levels to ascend from the captioned directory when setting the source directory for the rearrange stage. By default, this is set to 0, meaning no change from the captioned directory level.
Example usage: --rearrange_up_levels 2
arrange_format: It defines the directory hierarchy for dataset arrangement. The default format is n_characters/character. Other valid components are character_string (useful in the case of further character refinement) and image_type (should be used with --rearrange_up_levels set to positive values).
Example usage: --arrange_format n_characters/character_string/image_type
max_character_number: This argument determines the naming convention for n_characters folders. When set, any image containing more than the specified number of characters will be grouped into a single folder named with the format {n}+_characters, where n is the number specified. The default value is 6.
Example usage: --max_character_number 2
min_images_per_combination: This sets the minimum number of images required for a specific character combination to have its own directory. If the number of images for a particular character combination is below this threshold, the images are placed in a character_others directory. The default number is 10.
Example usage: --min_images_per_combination 15