Learning without supervision | Berken Utku Demirel

Complete unsupervision

Almost all self-supervised and unsupervised learning methods require fine-tuning for the downstream task additional to the optimized hand tuned augmentations during pre-training [1-2]. These methods optimize hyperparameters, such as augmentation strategies, in a way that even minor parameter changes can degrade performance. This raises a question: if the downstream task is known, can we design a learner that extracts meaningful representations without the need for augmentations?

Unsupervised periodic source detection

Detecting periodic patterns with complete unsupervision is crucial for a wide range of applications, from health monitoring to behavioral analysis. Traditional approaches to detecting periodic structures, such as the Fourier transform and autocorrelation, often fail when the data contain multiple overlapping periodic components or are noisy. Learning techniques, while powerful, require well-designed augmentations and labels in the end for fine tuning. Moreover, if the augmentation strategy is not well designed, the model can collapse—where all representations collapse to a single point due to overly strong augmentations [1].

To address these challenges, we propose three regularizers that enable the extract representations without requiring augmentations or labeled data to detect a periodic pattern in the desired frequency band.

Maximizing periodicity

Since the concern is finding the periodic componenents in the data, a tailored regularizer is designed to maximize the periodicity in the desired frequency band $f$ by minimizing the spectral entropy as in below.

$$ \mathcal{L}_{se} = - \sum_{\omega} S_{f_{\theta}}(\omega) \log S_{f_{\theta}}(\omega) $$

$$ S_{f_{\theta}}(\omega) = \frac{S_{f_{\theta}}(\omega)}{\sum_{\omega} S_{f_{\theta}}(\omega)} \hspace{2mm} \text{and} \hspace{2mm} S_{f_{\theta}}(\omega) = \left| \sum_{n} f_{\theta}(\mathbf{x})_n e^{-j\omega n} \right|^2, $$

where $f_{\theta}(\mathbf{x})$ and $S_{f_{\theta}}(\omega)$ are the learned representations and their spectral density, respectively. By minimizing this spectral entropy ($\mathcal{L}_{se}$), the model enforces periodic structures while reducing noise.

However, simply minimizing spectral entropy can lead to a degenerate solution where the model outputs the same representation for all inputs, i.e., collapsed model.

Preventing collapse

A major issue in self-supervised learning is representation collapse, where all embeddings collapse to a point and become indistinguishable. Previous self-supervised learning methods develop their own strategy to prevent the collapse. For example, contrastive learning relies on positive/negative sample pairs with data augmentation, or Barlow Twins [2] employs batch-wise variance constraints that are used to enforce diversity in representations. However, these approaches can be problematic, where batch statistics may not always be representative due to noise or data augmentations are not fine tuned.

Therefore, to prevent the collapse in a more principled approach, an additional constraint is used to ensure that representations retain information from the original samples.

$$ L_{ds} = \sum_{\omega} S_X(\omega) \log \frac{S_X(\omega)}{S_{f_{\theta}}(\omega)}, $$

$$ S_X(\omega) = \frac{S_X(\omega)}{\sum_{\omega} S_X(\omega)}, \hspace{1mm} S_X(\omega) = \left| \sum_{n}^{} x_n e^{-j \omega n} \right|^2 $$

This relative entropy constraint between the original samples $\mathbf{x}$ and the extracted representation $f_{\theta}(\mathbf{x})$ enforces diversity in the learned representations without imposing random variance constraints on representations. By leveraging spectral entropy minimization and information preservation, this approach overcomes the limitations of traditional self-supervised learning techniques. It eliminates the need for handcrafted augmentations, ensuring robust representation learning without requiring labeled data.

Why not unsupervised methods for all down-stream tasks?

Although the presented unsupervised technique has several advantages, there is a non-ignorable performance gap between the supervised techniques when signal-to-noise ratio (SNR) of the training set is low. Since the presented approach learns the function $f_{\theta}$ by minimizing the expectation of the proposed loss, which is known as the empirical risk minimization (ERM), the model converges to a point where the risk is minimal in the training set. When the process is completely unsupervised, the model might inadvertently learn the periodic noise or random features from the data, which leads to failures during the evaluation. Therefore, the performance depends on the data quality of the training set. For instance, if the goal is to learn to detect apples in images by enforcing their characteristic features (e.g., shape, color, texture), the model cannot learn effectively if the apples are completely obscured within the dataset.

References

1. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020.

2. Barlow twins: Self-supervised learning via redundancy reduction. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021.