Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. In this work, we demonstrate that grokking can be induced by regularization, either explicit or implicit. More precisely, we show that when there exists a model with a property P (e.g., sparse or low-rank weights) that generalizes on the problem of interest, gradient descent with a small but non-zero regularization of P (e.g., ℓ1 or nuclear norm regularization) result in grokking. This extends previous work showing that small non-zero weight decay induces grokking. Moreover, our analysis shows that over-parameterization by adding depth makes it possible to grok or ungrok without explicitly using regularization, which is impossible in shallow cases. We further show that the ℓ2 norm of the model parameters cannot be used as an indicator of grokking in a general setting in place of the regularized property P: the ℓ2 norm grows in many cases where no weight decay is used, but the model generalizes anyway. We also show that grokking can be amplified through only data selection (with any other hyperparameter fixed).
Keywords: Grokking, Delayed Generalization, Regularization, Sparsity, Low-Rank, Overparameterization, Gradient Descent, Implicit Regularization
Leave a Reply