0.2.0
-
Add the SqueezeBERT model (Iandola et al., 2020). The SqueezeBERT model replaces the matrix multiplications in the self-attention mechanism and feed-forwared layers by grouped convolutions. This results in a smaller number of parameters and better computational performance.
-
Add the SqueezeAlbert model. This model combines SqueezeBERT (Iandola et al., 2020) and ALBERT (Lan et al., 2020)
-
distill
: add theattention-loss
option. Enabling this option adds the mean squared error (MSE) of the teacher and student attentions to the loss. This can speed up convergence, because the student learns to attend to the same pieces as the teacher.Attention loss can only be computed when the teacher and student have the same sequence lengths. This means practically that they should use the same piece tokenizers.
-
Switch to the AdamW optimizer provided by
libtorch
. The tch binding now has support for the AdamW optimizer and for parameter groups. Consequently, we do not need our own AdamW optimizer implementation anymore. Switching to the Torch optimizer also speeds up training a bit. -
Move the subword tokenizers into a separate
syntaxdot-tokenizers
crate. -
Update to
libtorch
1.7.0. -
Remove the
server
subcommand. The new REST server is a better replacement, which supports proper error handling, etc.