LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (2025)

\setabbreviationstyle

[acronym]long-short

Michelle Halbheer
ETH Zürich
&Dominik J. Mühlematterfootnotemark:
ETH Zürich
&Alexander Becker
ETH Zürich
&Dominik Narnhofer
ETH Zürich
&Helge Aasen
Agroscope
&Konrad Schindler
ETH Zürich
&Mehmet Ozgur Turkoglu
Agroscope
Equal contribution.

Abstract

Numerous crucial tasks in real-world decision-making rely on machine learning algorithms with calibrated uncertainty estimates. However, modern methods often yield overconfident and uncalibrated predictions. Various approaches involve training an ensemble of separate models to quantify the uncertainty related to the model itself, known as epistemic uncertainty. In an explicit implementation, the ensemble approach has high computational cost and high memory requirements. This particular challenge is evident in state-of-the-art neural networks such as transformers, where even a single network is already demanding in terms of compute and memory. Consequently, efforts are made to emulate the ensemble model without actually instantiating separate ensemble members, referred to as implicit ensembling. We introduce LoRA-Ensemble, a parameter-efficient deep ensemble method for self-attention networks, which is based on Low-Rank Adaptation (LoRA). Initially developed for efficient LLM fine-tuning, we extend LoRA to an implicit ensembling approach. By employing a single pre-trained self-attention network with weights shared across all members, we train member-specific low-rank matrices for the attention projections. Our method exhibits superior calibration compared to explicit ensembles and achieves similar or better accuracy across various prediction tasks and datasets.

1 Introduction

Machine learning models are increasingly applied also in fields where incorrect estimates may have severe consequences, e.g., autonomous driving, medical diagnosis, (extreme) weather event prediction, or agricultural management decision support. In such applications well-calibrated predictive uncertainties are crucial to enable self-diagnosis. Uncertainty can be separated into two components. Aleatoric uncertainty, a.k.a.irreducible noise, is inherent in the data. Epistemic uncertainty on the other hand stems from a lack of knowledge about certain regions of the input space, due to a lack of training data [DerKiureghian and Ditlevsen, 2009].

Quantification of epistemic uncertainty in large machine learning models is non-trivial. Analytical computation is usually intractable, thus research has focused on efficient approximations [Graves, 2011, Blundell etal., 2015, Welling etal., 2011]. To date, probabilistic ensembles remain the best-performing approach [Lakshminarayanan etal., 2017]. In a naïve implementation, such an ensemble consists of multiple independently trained models. Individual models are interpreted as Monte Carlo samples from the posterior weight space and are used to obtain an unbiased estimator of the posterior distribution. To achieve a low correlation between ensemble members one can capitalize on the stochastic nature of the training process and start from different initial weights, and/or sample different random batches of data. The basic principle is that the predictions of different ensemble members will agree near observed training samples, whereas they may vary far away from the training data. Their spread therefore serves as a measure of epistemic uncertainty. Even small ensembles often capture the uncertainty well (in expectation), i.e., they are well calibrated.

An issue with naïve ensembles is that their computational cost and memory footprint grow proportionally to the number of ensemble members. For smaller models explicit ensembling may still be feasible, albeit with higher financial cost and energy consumption. For modern neural networks with up to several billion parameters, hardware restrictions render the naïve approach intractable, in particular, one can no longer hold the entire ensemble in memory.Consequently, a lot of research has gone into ways of creating ensembles implicitly, without requiring multiple copies of the full base model [Wen etal., 2020, Wenzel etal., 2020, Huang etal., 2017, Turkoglu etal., 2022].Unfortunately, most of these parameter-efficient ensembling techniques are not applicable to the newest generation of neural networks. Transformer networks [Vaswani etal., 2017] have recently become popular due to their superior ability to capture complex structures in data. However, implicit ensembling schemes tend to underperform for transformers or are entirely incompatible with them, as detailed in AppendixM.

Several studies have shown that modern neural networks are heavily overparametrized and that their results have low intrinsic dimension [Li etal., 2018a, Aghajanyan etal., 2020]. This led Hu etal. [2021] to propose \glsxtrfullpllora as a way of deploying individually fine-tuned \glsxtrfullplllm to different tasks while avoiding the prohibitively large memory and compute requirements of retraining them. It turns out that the weight matrices in such models can be factorized to have very low rank, with hardly any loss in prediction performance.

This led us to use \glsxtrshortlora as the basis for a novel, parameter-efficient ensemble method that is tailored to the transformer architecture. In line with the trend towards transfer learning, our method uses a pre-trained transformer model, which is expanded into an implicit ensemble by varying the \glsxtrshortlora factorization, while keeping the backbone weights frozen. In this way, our method only requires a small number of additional parameters to turn an existing transformer model into a diverse ensemble whose performance across various tasks is comparable to an Explicit Ensemble.In summary, our contributions are:

  • We introduce \glsxtrshortlora-Ensemble, a parameter-efficient probabilistic ensemble method for self-attention networks.

  • \glsxtrshort

    lora-Ensemble can be readily combined with most pre-trained transformer networks, irrespective of their specific architecture and application domain: it simply replaces the linear projection layers in the attention module with \glsxtrshortlora-Ensemble layers.

  • We apply \glsxtrshortlora-Ensemble to different classification tasks including: conventional image labeling, classification of skin lesions in dermatoscopic images, sound classification from spectrograms, and out-of-distribution (OOD) detection. In these experiments, \glsxtrshortlora-Ensemble not only consistently outperforms other implicit ensemble schemes but also, surprisingly, its classification accuracy and uncertainty calibration are often even better than that of an Explicit Ensemble.

2 LoRA-Ensemble

The \glsxtrfulllora technique makes it possible to use a pre-trained model and fine-tune it without having to retrain all its parameters. This is particularly beneficial for modern neural networks with large parameter spaces. The underlying principle is to freeze the pre-trained model weights W0k×dsubscript𝑊0superscript𝑘𝑑W_{0}\in\mathbb{R}^{k\times d}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d end_POSTSUPERSCRIPT and instead constrain the updates to a low-rank decomposition. This can be expressed mathematically as:

W=W0+ΔW=W0+BA.𝑊subscript𝑊0Δ𝑊subscript𝑊0𝐵𝐴W=W_{0}+\Delta W=W_{0}+B\!\cdot\!A\;.italic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_Δ italic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_B ⋅ italic_A .(1)

Here Bk×r𝐵superscript𝑘𝑟B\in\mathbb{R}^{k\times r}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_r end_POSTSUPERSCRIPT and Ar×d𝐴superscript𝑟𝑑A\in\mathbb{R}^{r\times d}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_d end_POSTSUPERSCRIPT are two trainable low-rank matrices, where rmin(d,k)much-less-than𝑟𝑑𝑘r\ll\min(d,k)italic_r ≪ roman_min ( italic_d , italic_k ). W𝑊Witalic_W and ΔWΔ𝑊\Delta Wroman_Δ italic_W are then multiplied with the same input x𝑥xitalic_x, which yields the following modified forward pass:

h=W0x+ΔWx=W0x+BAx.subscript𝑊0𝑥Δ𝑊𝑥subscript𝑊0𝑥𝐵𝐴𝑥h=W_{0}\cdot x+\Delta W\cdot x=W_{0}\cdot x+B\!\cdot\!A\cdot x\;.italic_h = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ italic_x + roman_Δ italic_W ⋅ italic_x = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ italic_x + italic_B ⋅ italic_A ⋅ italic_x .(2)

\glsxtrshort

lora applies this low-rank updating scheme only to weights in the self-attention modules of a transformer model while leaving the interleaved MLP modules untouched. I.e., the weight matrices being updated are Wqsubscript𝑊𝑞W_{q}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT for the query, key, and value of the attention mechanism, as well as the Wosubscript𝑊𝑜W_{o}italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT for merging the multi-head outputs. The former three are each treated as a single matrix, disregarding the fact that they are typically sliced into multiple attention heads. [Hu etal., 2021]

Although not designed with uncertainty calibration in mind, the \glsxtrshortlora concept fulfills all the requirements of an implicit deep ensemble: By modifying the weights of the highly nonlinear self-attention mechanism one is able to generate a diverse collection of networks with the same architecture and objective. By learning an additive, low-rank update ΔW=BAΔ𝑊𝐵𝐴\Delta W=B\!\cdot\!Aroman_Δ italic_W = italic_B ⋅ italic_A rather than directly tuning the weight matrices, the expansion into a model ensemble adds only a small number of parameters and is efficient.In detail, we start from a single, pre-trained model with frozen parameters W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and expand it with a set of trainable low-rank matrices ΔWiΔsubscript𝑊𝑖\Delta W_{i}roman_Δ italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i=1Nfor-all𝑖1𝑁\forall i=1\ldots N∀ italic_i = 1 … italic_N.At each transformer block, there now is a different forward pass per ensemble member i𝑖iitalic_i, as illustrated in Fig.1:

hi=W0x+ΔWix=W0x+BiAix,subscript𝑖subscript𝑊0𝑥Δsubscript𝑊𝑖𝑥subscript𝑊0𝑥subscript𝐵𝑖subscript𝐴𝑖𝑥h_{i}=W_{0}\cdot x+\Delta W_{i}\cdot x=W_{0}\cdot x+B_{i}\!\cdot\!A_{i}\cdot x\;,italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ italic_x + roman_Δ italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_x = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ italic_x + italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_x ,(3)

leading to N𝑁Nitalic_N different predictions Tθi(X)subscript𝑇subscript𝜃𝑖𝑋T_{\theta_{i}}(X)italic_T start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X ) for a given input X𝑋Xitalic_X. From those individual predictions, we compute the ensemble estimate by simple averaging:

𝔼[Y|X]=1Ni=1NTθi(X).𝔼delimited-[]conditional𝑌𝑋1𝑁superscriptsubscript𝑖1𝑁subscript𝑇subscript𝜃𝑖𝑋\mathbb{E}[Y|X]=\frac{1}{N}\sum_{i=1}^{N}T_{\theta_{i}}(X)\;.blackboard_E [ italic_Y | italic_X ] = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X ) .(4)
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (1)

2.1 Implementation

In practice, our \glsxtrshortlora-Ensemble is implemented by replacing the respective linear layers (Wqsubscript𝑊𝑞W_{q}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, and Wosubscript𝑊𝑜W_{o}italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT) in the pre-trained model architecture with custom \glsxtrshortlora modules.

As a backbone for experiments with image datasets, we employ a \glsxtrfullvit model [Dosovitskiy etal., 2020]. The chosen architecture is the base variant with patch size 32×32323232\times 3232 × 32 as defined in Dosovitskiy etal. [2020]. We load the weights from torchvision, which were trained on ImageNet-1k [Deng etal., 2009], using a variant of the training recipe from Touvron etal. [2020], for details refer to their documentation.

The forward pass through the backbone is parallelized by replicating the input along the batch dimension. In each \glsxtrshortlora module, the data is split into separate inputs per member and passed to the respective member with the help of a vectorized map, which allows a parallelized forward pass even through the \glsxtrshortlora modules. The outputs are then again stacked along the batch dimension. In this way, one makes efficient use of the parallelization on \glsxtrshortgpu, while at the same time avoiding loading the pre-trained backbone into memory multiple times.

As a backbone for audio experiments, we use the \glsxtrfullast backbone[Gong etal., 2021]. That architecture was inspired by \glsxtrshortvit (more specifically the data-efficient version of \glsxtrshortvit akin to \glsxtrshortdeit[Touvron etal., 2020]) but is designed specifically for audio spectrograms. Following Gong etal. [2021], we initialize the audio model weights by transferring and appropriately interpolating them from ImageNet pre-training. See AppendixG andH for details.As the \glsxtrshortast version of \glsxtrshortlora-Ensemble would run into memory limits, we introduce chunking. While the forward pass through the backbone is still parallelized, the \glsxtrshortlora modules are called sequentially.111For the Explicit Ensemble the vectorization could not be used on GPU, due to a technical issue with the \glsxtrshortvit implementation in PyTorch.

Finally, the pre-trained model does not have the correct output dimension for our prediction tasks (i.e., it was trained for a different number of classes). Therefore we entirely discard its last layer and add a new one with the correct dimensions, which we train from scratch. Obviously, the weights of that last layer are different for every ensemble member. We parallelize it in the same way as the \glsxtrshortlora module described above.

A PyTorch implementation of \glsxtrshortlora-Ensemble, as well as pre-trained weights to reproduce the experiments, will be publicly released on GitHub.

3 Experiments

In the following section, we evaluate the proposed \glsxtrshortlora-Ensemble on several datasets with regard to its predictive accuracy, uncertainty calibration, and memory usage. For each experiment we also show 1-sigma error bars, estimated from five independent runs with different random initializations.

As a first sandbox experiment, we perform image classification for the popular, widely used CIFAR-100 benchmark [Krizhevsky, 2009] (see AppendixA for CIFAR-10 experiment). The dataset consists of 100 object classes, each with 600 samples, for a total size of 60 000 images. From that set, 10 000 images are designated test data, with all classes equally distributed between the training and testing portions.

The HAM10000 dataset was proposed for the Human Against Machine with 10 000 training images study [Tschandl etal., 2018]. It consists of 10 015 dermatoscopic images of pigmented skin lesions, collected from different populations. The dataset was initially assembled to compare machine learning methods against medical professionals on the task of classifying common pigmented skin lesions. Compared to CIFAR-100, this is arguably the more relevant test bed for our method: in the medical domain, uncertainty calibration is critical, due to the potentially far-reaching consequences of incorrect diagnoses and treatment planning.

For both datasets, \glsxtrshortlora-Ensemble is compared against several baselines. As a sanity check, we always include results obtained with a single \glsxtrfullvit model, as well as with a single \glsxtrshortvit model with \glsxtrshortlora in the attention modules. These models do not have a dedicated mechanism for uncertainty calibration, instead, the predicted class-conditional likelihoods are used to quantify uncertainty. Furthermore, we compare to an explicit model ensemble, \glsxtrfullmcdropout as implemented in Li etal. [2023] and a modified version of Snapshot Ensemble [Huang etal., 2017], detailed in AppendixL.Snapshot Ensemble is the only well established implicit ensembling technique that is architecture agnostic and can therefore be applied to self-attention networks in a straightforward fashion. For implementation challenges of other implicit methods, please refer to AppendixM. The \glsxtrshortlora rank was empirically set to 8 for CIFAR-100 and 4 for HAM10000.

We evaluate predictive performance and calibration quality for each method using multiple metrics. Predictive accuracy is assessed with classification accuracy (percentage of correctly classified test samples) and the F1-score, which balances precision and recall. Calibration quality is measured using the Expected Calibration Error (ECE), Negative Log-Likelihood (NLL), and Brier score. The ECE quantifies the deviation from a perfectly calibrated model, i.e., one where the estimated uncertainty of the maximum-likelihood class correctly predicts the likelihood of a miss-classification. Definitions of all metrics are provided in Appendix N.

As a further benchmark from a different application domain, we process the ESC-50 environmental sounds dataset [Piczak, 2015]. It consists of 2000 sound samples, each five seconds long, that represent 50 different semantic classes with 40 samples each. To prepare the raw input waveforms for analysis, they are converted into 2-dimensional time/frequency spectrograms, see Gong etal. [2021]. These spectrograms form the input for the \glsxtrlongast, a state-of-the-art transformer model for sound classification.

As for the \glsxtrshortvit model, we train an \glsxtrlongast version of \glsxtrshortlora-Ensemble by modifying the attention weights with different sets of \glsxtrshortlora weights. That ensemble is then compared to a single instance of \glsxtrshortast with and without \glsxtrshortlora, to an Explicit Ensemble of \glsxtrshortast-models, and to an \glsxtrshortmcdropout variant of \glsxtrshortast, similar to Li etal. [2023]. For ESC-50 a \glsxtrshortlora rank of 16 worked best, presumably due to the larger domain gap between (image-based) pre-training and the actual audio classification task.The experimental evaluation in Gong etal. [2021] employs the same performance metrics as before, but a slightly different evaluation protocol. Model training (and evaluation) is done in a 5-fold cross-validation setting, where the epoch with the best average accuracy across all five folds is chosen as the final model. The performance metrics given below are calculated by taking the predictions of all five folds at the chosen epoch and evaluatingaccuracy and calibration metrics jointly.While the accuracy calculated this way is equivalent to the average of all five folds,others are not, so this method results in a more realistic calculation of the calibration metrics.

For the out-of-distribution (OOD) detection experiment, we trained models on the CIFAR-100 dataset and evaluated their performance using samples from CIFAR-100 (in-distribution) and CIFAR-10 (out-of-distribution), following standard OOD detection practices Hendrycks and Gimpel [2016]. We assessed the performance by calculating the area under the ROC curve (AUROC) and the area under the precision-recall curve (AUPRC).

3.1 Computational Cost

In addition to evaluating classification performance and calibration, we assess the computational cost in terms of parameters, training, and inference time. The required resources are presented in Tab.1.

MethodParameter overheadTraining time [s]Inference time [ms]
Explicit Ensemble16×87M1687M16\times 87\mathrm{M}16 × 87 roman_M16×1391613916\times 13916 × 13916×4.6164.616\times 4.616 × 4.6
\glsxtrshortlora-Ensemble1.12×87M1.1287M1.12\times 87\mathrm{M}1.12 × 87 roman_M110822.7

The total number of parameters is reported for an ensemble of 16 members, and matrices A𝐴Aitalic_A and B𝐵Bitalic_B with rank 8 when using \glsxtrshortlora. Choosing a different rank will slightly alter the parameter count. In many cases a lower rank may suffice, cf.Hu etal. [2021]. All times were measured on a single NVIDIA Tesla A100-80GB GPU. Training time is given as the average wall clock time per training epoch on CIFAR-100, with 16 ensemble members. Inference time is computed as the average time for a single forward pass for a CIFAR-100 example, with batch size 1. As mentioned in Sec.2.1, the forward pass for the Explicit Ensemble processes the members sequentially.222Speed comparisons only make sense with the same resources. With sufficiently many GPUs any ensemble method can be parallelized by instantiating explicit copies of different members on separate GPUs. Hence, we calculate the average time needed for one member and multiply it by 16.It is evident that the proposed method uses significantly fewer parameters and less memory. \glsxtrshortlora-Ensemble also trains faster, and speeds up inference more than 3 times.

We point out that, with our current implementation, the runtime comparisons are still indicative. It turns out that PyTorch’s vectorized map (vmap) has a large one-time overhead that is only amortized when using large ensembles, while small ensembles are slowed down. Practical ensemble sizes will benefit when implemented in a framework that supports just-in-time compilation, like JAX.

3.2 CIFAR-100

Classification accuracy and \glsxtrshortece are both graphed against ensemble size in Fig.2. Quantitative results for all compared methods are summarized in Tab.2.

\glsxtrshort

lora-Ensemble consistently reaches higher accuracy than \glsxtrshortmcdropout and Snapshot Ensemble, with a notable edge of approximately 5 percentage points for ensembles of four or more members. Surprisingly, it also consistently surpasses the Explicit Ensemble by about 2 percentage points, apparently a consequence of the fact that already a single \glsxtrshortvit model, and thus every ensemble member, benefits from the addition of \glsxtrshortlora.

LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (2)
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (3)

With \glsxtrshortlora-Ensemble also the estimates of predictive uncertainty are better calibrated. Interestingly, the calibration is already very good at small ensemble sizes but slightly degrades when adding more members. This effect is not present when looking at NLL and Brier score, though, Tab2.The reliability diagram in Fig.3 somewhat elucidates this unexpected behavior. It turns out that \glsxtrshortlora-Ensemble is generally under-confident, meaning that the classification is more accurate than the model suggests. Rahaman and Thiery [2020] have found that when ensembling under-confident models, the accuracy grows faster than the confidence. As a result, the difference between accuracy and confidence tends to grow, worsening calibration metrics. Note that in safety-critical applications under-confident models that over-estimate the uncertainty are often preferable to over-confident ones.

LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (4)
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (5)

\glsxtrshort

mcdropout is not well calibrated for smaller ensembles, but progressively catches up as the ensemble size increases.Snapshot Ensemble performs similarly to \glsxtrshortmcdropout in terms of accuracy but does not perform competitively for calibration.

MethodAccuracy (\uparrow)F1 (\uparrow)ECE (\downarrow)NLL (\downarrow)Brier (\downarrow)
Single Network76.6±0.3plus-or-minus76.60.376.6\pm 0.376.6 ± 0.376.6±0.3plus-or-minus76.60.376.6\pm 0.376.6 ± 0.30.145±0.004plus-or-minus0.1450.0040.145\pm 0.0040.145 ± 0.0041.181±0.019plus-or-minus1.1810.0191.181\pm 0.0191.181 ± 0.0190.370±0.004plus-or-minus0.3700.0040.370\pm 0.0040.370 ± 0.004
Single Net w/ LoRA79.6±0.2plus-or-minus79.60.279.6\pm 0.279.6 ± 0.279.4±0.2plus-or-minus79.40.279.4\pm 0.279.4 ± 0.20.014±0.003plus-or-minus0.0140.003\textbf{0.014}\pm 0.0030.014 ± 0.0030.671¯±0.005plus-or-minus¯0.6710.005\underline{0.671}\pm 0.005under¯ start_ARG 0.671 end_ARG ± 0.0050.286±0.003plus-or-minus0.2860.0030.286\pm 0.0030.286 ± 0.003
MC Dropout77.1±0.5plus-or-minus77.10.577.1\pm 0.577.1 ± 0.577.2±0.4plus-or-minus77.20.477.2\pm 0.477.2 ± 0.40.055±0.002plus-or-minus0.0550.0020.055\pm 0.0020.055 ± 0.0021.138±0.014plus-or-minus1.1380.0141.138\pm 0.0141.138 ± 0.0140.336±0.005plus-or-minus0.3360.0050.336\pm 0.0050.336 ± 0.005
Snapshot Ensemble77.0±0.1plus-or-minus77.00.177.0\pm 0.177.0 ± 0.177.2±0.2plus-or-minus77.20.277.2\pm 0.277.2 ± 0.20.123±0.002plus-or-minus0.1230.0020.123\pm 0.0020.123 ± 0.0024.416±0.046plus-or-minus4.4160.0464.416\pm 0.0464.416 ± 0.0461.614±0.007plus-or-minus1.6140.0071.614\pm 0.0071.614 ± 0.007
Explicit Ensemble79.8¯±0.1plus-or-minus¯79.80.1\underline{79.8}\pm 0.1under¯ start_ARG 79.8 end_ARG ± 0.179.8¯±0.2plus-or-minus¯79.80.2\underline{79.8}\pm 0.2under¯ start_ARG 79.8 end_ARG ± 0.20.100±0.001plus-or-minus0.1000.0010.100\pm 0.0010.100 ± 0.0010.745±0.003plus-or-minus0.7450.0030.745\pm 0.0030.745 ± 0.0030.284¯±0.002plus-or-minus¯0.2840.002\underline{0.284}\pm 0.002under¯ start_ARG 0.284 end_ARG ± 0.002
LoRA-Ensemble82.5±0.1plus-or-minus82.50.1\textbf{82.5}\pm 0.182.5 ± 0.182.5±0.1plus-or-minus82.50.1\textbf{82.5}\pm 0.182.5 ± 0.10.035¯±0.001plus-or-minus¯0.0350.001\underline{0.035}\pm 0.001under¯ start_ARG 0.035 end_ARG ± 0.0010.587±0.001plus-or-minus0.5870.001\textbf{0.587}\pm 0.0010.587 ± 0.0010.253±0.000plus-or-minus0.2530.000\textbf{0.253}\pm 0.0000.253 ± 0.000

3.3 HAM10000 Lesion Classification

In many medical applications, well-calibrated models are essential. As a test case, we use the classification of pigmented skin lesions and again compare the same group of models in terms of accuracy and calibration. The results are summarized in Tab.3.

Similar to the CIFAR-100 evaluation, \glsxtrshortlora-Ensemble outperforms all other methods by a clear margin, with respect to both classification accuracy and calibration.Surprisingly, Snapshot Ensemble performs very well in terms of calibration but is not competitive as far as accuracy is concerned.The experiments also further support the above discussion of confidence vs.ensemble size (Sec.3.2). For HAM10000 \glsxtrshortlora-Ensemble is slightly over-confident (just like the Explicit Ensemble) and, indeed, its calibration error decreases with ensemble size in this case, see AppendixA.2.

We conducted further experiments on HAM10000 using different backbone architectures with varying numbers of parameters. The results are shown in Tab.7 in AppendixB.In conclusion, LoRA-Ensemble generalizes effectively across different backbones. Moreover, as the number of parameters in the backbone architecture increases, the superiority of LoRA-Ensemble over Explicit Ensemble in both accuracy and calibration becomes more pronounced.

.MethodAccuracy (\uparrow)F1 (\uparrow)ECE (\downarrow)NLL (\downarrow)Brier (\downarrow)Single Network84.1±0.3plus-or-minus84.10.384.1\pm 0.384.1 ± 0.371.4±0.7plus-or-minus71.40.771.4\pm 0.771.4 ± 0.70.139±0.004plus-or-minus0.1390.0040.139\pm 0.0040.139 ± 0.0041.138±0.040plus-or-minus1.1380.0401.138\pm 0.0401.138 ± 0.0400.291±0.009plus-or-minus0.2910.0090.291\pm 0.0090.291 ± 0.009Single Net w/ LoRA83.2±0.7plus-or-minus83.20.783.2\pm 0.783.2 ± 0.770.7±1.3plus-or-minus70.71.370.7\pm 1.370.7 ± 1.30.085±0.004plus-or-minus0.0850.0040.085\pm 0.0040.085 ± 0.0040.569±0.027plus-or-minus0.5690.0270.569\pm 0.0270.569 ± 0.0270.256±0.011plus-or-minus0.2560.0110.256\pm 0.0110.256 ± 0.011MC Dropout83.7±0.4plus-or-minus83.70.483.7\pm 0.483.7 ± 0.471.0±0.9plus-or-minus71.00.971.0\pm 0.971.0 ± 0.90.099±0.007plus-or-minus0.0990.0070.099\pm 0.0070.099 ± 0.0070.631±0.023plus-or-minus0.6310.0230.631\pm 0.0230.631 ± 0.0230.270±0.009plus-or-minus0.2700.0090.270\pm 0.0090.270 ± 0.009Snapshot Ensemble84.9±0.3plus-or-minus84.90.384.9\pm 0.384.9 ± 0.373.7±0.9plus-or-minus73.70.973.7\pm 0.973.7 ± 0.90.058¯±0.004plus-or-minus¯0.0580.004\underline{0.058}\pm 0.004under¯ start_ARG 0.058 end_ARG ± 0.0040.431¯±0.007plus-or-minus¯0.4310.007\underline{0.431}\pm 0.007under¯ start_ARG 0.431 end_ARG ± 0.0070.217¯±0.004plus-or-minus¯0.2170.004\underline{0.217}\pm 0.004under¯ start_ARG 0.217 end_ARG ± 0.004Explicit Ensemble85.8¯±0.2plus-or-minus¯85.80.2\underline{85.8}\pm 0.2under¯ start_ARG 85.8 end_ARG ± 0.274.6¯±0.4plus-or-minus¯74.60.4\underline{74.6}\pm 0.4under¯ start_ARG 74.6 end_ARG ± 0.40.105±0.002plus-or-minus0.1050.0020.105\pm 0.0020.105 ± 0.0020.536±0.007plus-or-minus0.5360.0070.536\pm 0.0070.536 ± 0.0070.218±0.002plus-or-minus0.2180.0020.218\pm 0.0020.218 ± 0.002LoRA-Ensemble88.0±0.2plus-or-minus88.00.2\textbf{88.0}\pm 0.288.0 ± 0.278.3±0.6plus-or-minus78.30.6\textbf{78.3}\pm 0.678.3 ± 0.60.037±0.002plus-or-minus0.0370.002\textbf{0.037}\pm 0.0020.037 ± 0.0020.342±0.003plus-or-minus0.3420.003\textbf{0.342}\pm 0.0030.342 ± 0.0030.175±0.002plus-or-minus0.1750.002\textbf{0.175}\pm 0.0020.175 ± 0.002

3.4 ESC-50 Environmental Sound Classification

To go beyond computer vision tasks, \glsxtrshortlora-Ensemble is also applied to an audio dataset, using the \glsxtrlongast as the backbone model.

The results are summarized in Tab.4.On this dataset \glsxtrshortlora-Ensemble does not significantly outperform the Explicit Ensemble, but still matches its performance with much lower computational demands, see AppendixJ. Accuracy is insignificantly lower, whereas calibration is slightly better in terms of ECE. We note that, remarkably, the weights used in the transformer modules and for creating patch embeddings were pre-trained on images rather than audio streams.

MethodAccuracy (\uparrow)F1 (\uparrow)ECE (\downarrow)NLL (\downarrow)Brier (\downarrow)
Single Network89.6±0.7plus-or-minus89.60.789.6\pm 0.789.6 ± 0.789.5±0.7plus-or-minus89.50.789.5\pm 0.789.5 ± 0.70.039±0.004plus-or-minus0.0390.0040.039\pm 0.0040.039 ± 0.0040.410±0.020plus-or-minus0.4100.0200.410\pm 0.0200.410 ± 0.0200.164±0.009plus-or-minus0.1640.0090.164\pm 0.0090.164 ± 0.009
Single Net w/ LoRA88.0±0.3plus-or-minus88.00.388.0\pm 0.388.0 ± 0.387.8±0.3plus-or-minus87.80.387.8\pm 0.387.8 ± 0.30.043±0.004plus-or-minus0.0430.0040.043\pm 0.0040.043 ± 0.0040.461±0.019plus-or-minus0.4610.0190.461\pm 0.0190.461 ± 0.0190.186±0.005plus-or-minus0.1860.0050.186\pm 0.0050.186 ± 0.005
MC Dropout89.4±0.3plus-or-minus89.40.389.4\pm 0.389.4 ± 0.389.3±0.4plus-or-minus89.30.489.3\pm 0.489.3 ± 0.40.087±0.005plus-or-minus0.0870.0050.087\pm 0.0050.087 ± 0.0050.553±0.012plus-or-minus0.5530.0120.553\pm 0.0120.553 ± 0.0120.176±0.005plus-or-minus0.1760.0050.176\pm 0.0050.176 ± 0.005
Explicit Ensemble91.3±0.2plus-or-minus91.30.2\textbf{91.3}\pm 0.291.3 ± 0.291.2±0.3plus-or-minus91.20.3\textbf{91.2}\pm 0.391.2 ± 0.30.027¯±0.004plus-or-minus¯0.0270.004\underline{0.027}\pm 0.004under¯ start_ARG 0.027 end_ARG ± 0.0040.322±0.004plus-or-minus0.3220.004\textbf{0.322}\pm 0.0040.322 ± 0.0040.133±0.001plus-or-minus0.1330.001\textbf{0.133}\pm 0.0010.133 ± 0.001
LoRA-Ensemble91.1¯±0.2plus-or-minus¯91.10.2\underline{91.1}\pm 0.2under¯ start_ARG 91.1 end_ARG ± 0.290.8¯±0.2plus-or-minus¯90.80.2\underline{90.8}\pm 0.2under¯ start_ARG 90.8 end_ARG ± 0.20.021±0.003plus-or-minus0.0210.003\textbf{0.021}\pm 0.0030.021 ± 0.0030.328¯±0.004plus-or-minus¯0.3280.004\underline{0.328}\pm 0.004under¯ start_ARG 0.328 end_ARG ± 0.0040.138¯±0.001plus-or-minus¯0.1380.001\underline{0.138}\pm 0.001under¯ start_ARG 0.138 end_ARG ± 0.001

3.5 Out-of-Distribution (OOD) Detection

To evaluate our method’s effectiveness in OOD detection, a crucial aspect of quantifying uncertainty in deep learning modelsHendrycks and Gimpel [2016], we conducted an experiment where models were trained on CIFAR-100 (in-distribution) and tested on samples from both CIFAR-100 and CIFAR-10 (out-of-distribution). The similar distributions of these datasets make this a near-OOD task, which is more challenging than tasks involving distinctly different distributions[Sim etal., 2023]. Following Sim etal. [2023] and Chen etal. [2024], we used the maximum softmax probability as the confidence score.

Tab.5 shows that the \glsxtrshortlora-Ensemble significantly outperforms all other methods on both metrics, even surpassing the recently proposed Split-Ensemble methodChen etal. [2024] designed specifically for OOD tasks. Furthermore, consistent with our earlier observations on LoRA’s effectiveness in improving network calibration, even a single \glsxtrshortlora model achieves performance comparable to the Explicit Ensemble, highlighting its robustness in OOD scenarios.

MethodAUROC (\uparrow)AUPRC (\uparrow)
Split-Ensemble Chen etal. [2024]79.279.279.279.281.7
Single Network75.6±0.3plus-or-minus75.60.375.6\pm 0.375.6 ± 0.377.6±0.6plus-or-minus77.60.677.6\pm 0.677.6 ± 0.6
Single Network with LoRA80.1¯±0.5plus-or-minus¯80.10.5\underline{80.1}\pm 0.5under¯ start_ARG 80.1 end_ARG ± 0.582.4¯±0.6plus-or-minus¯82.40.6\underline{82.4}\pm 0.6under¯ start_ARG 82.4 end_ARG ± 0.6
MC Dropout75.1±0.5plus-or-minus75.10.575.1\pm 0.575.1 ± 0.573.7±0.9plus-or-minus73.70.973.7\pm 0.973.7 ± 0.9
Explicit Ensemble78.9±0.2plus-or-minus78.90.278.9\pm 0.278.9 ± 0.280.8±0.2plus-or-minus80.80.280.8\pm 0.280.8 ± 0.2
LoRA-Ensemble82.1±0.1plus-or-minus82.10.1\textbf{82.1}\pm 0.182.1 ± 0.184.1±0.1plus-or-minus84.10.1\textbf{84.1}\pm 0.184.1 ± 0.1

3.6 Sensitivity Analysis: LoRA Rank

The main hyper-parameter introduced by adding \glsxtrshortlora is the rank of the low-rank decomposition (i.e., the common dimension of the matrices A𝐴Aitalic_A and B𝐵Bitalic_B).Varying that rank modulates the complexity of the model for the learning task. We have empirically studied the relationship between rank, accuracy, and \glsxtrlongece. Here we show results for HAM10000, additional results for CIFAR-100 dataset can be found in Appendix C.

On HAM10000 we observe a clear trade-off between accuracy and calibration, Fig.4. With increasing rank the classification accuracy increases while the calibration deteriorates, in other words, one can to some degree balance predictive accuracy against uncertainty calibration by choosing the rank.Our focus in this work is on model calibration. We therefore generally choose the rank to favor calibration, even at the cost of slightly lower classification accuracy.

LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (6)

4 Related Work

4.1 Estimation of Epistemic Uncertainty

A lot of work has gone into estimating the epistemic uncertainty in \glsxtrfullann. As the analytical computation of the posterior in such models is generally intractable, methods for approximate Bayesian inference have been proposed. Such methods rely on imposing an appropriate prior on the weights and using the likelihood of the training data to get an approximate posterior of the weight space.

The main techniques are, on the one hand, Variational Inference [Graves, 2011, Ranganath etal., 2014], which Blundell etal. [2015] have specialized to neural networks as Bayes by Backprop. And on the other hand variants of \glsxtrfullmcmc [Neal, 1996, Chen etal., 2014], including \glsxtrfullsgld [Welling etal., 2011]. These, however, are often not able to accurately capture high-dimensional and highly non-convex loss landscapes, like the ones usually encountered in deep learning [Gustafsson etal., 2019].

4.2 Ensembles and Implicit Ensembling

Lakshminarayanan etal. [2017] have proposed a method known as deep ensembles. It uses a set of neural networks with identical architecture that are independently and randomly initialized, and (as usual) trained with variants of \glsxtrfullsgd. While the latter introduces further stochasticity, Fort etal. [2019] have shown that the initialization of the weights is more important to explore the admissible weight space. Ensemble members will generally converge to different modes of the loss function, such that they can be considered Monte Carlo samples of the posterior distribution [Wilson and Izmailov, 2020, Izmailov etal., 2021]. While ensembles, in general, yield the best results in terms of accuracy and uncertainty calibration, a straightforward implementation suffers from high memory and compute requirements, since multiple instances of the full neural network must be trained and stored. This can become prohibitive for modern neural networks with many millions, or even billions, of parameters.

Consequently, researchers have attempted to find ways of mimicking the principle of deep ensembles without creating several full copies of the base model. Gal and Ghahramani [2015] have proposed \glsxtrlongmcdropout, where the posterior is approximated by sampling different dropout patterns at inference time. While this is less expensive in terms of memory, performance is often worse. Masksembles [Durasov etal., 2020] are a variant that attempts to select suitable dropout masks in order to obtain better uncertainty estimates.Snapshot Ensembles [Huang etal., 2017] use cyclic learning rates to steer the learning process such that it passes through multiple local minima, which are then stored as ensemble members. This reduces the training effort but does not address memory requirements or inference time.

Particularly relevant for our work are attempts that employ a shared backbone and modify only selected layers. Havasi etal. [2020] follow that strategy, in their case only the first and last layer of a neural network are replicated and trained independently to emulate an ensemble. BatchEnsemble [Wen etal., 2020] is similar to \glsxtrshortlora-Ensemble in that it also uses low-rank matrices to change the model parameters. More specifically, shared weight matrices are modulated by element-wise multiplication with different rank-1 matrices to achieve the behavior of a deep ensemble while adding only a small number of parameters. Wenzel etal. [2020] take this concept further by also ensembling over different hyper-parameter settings. Turkoglu etal. [2022] freeze all weights of the base model and instead vary the feature-wise linear modulation [FiLM, Li etal., 2018b, Takeda etal., 2021].

A related concept was recently introduced for \glsxtrshortplllm: the Mixtral of Experts model [Jiang etal., 2024] averages over a sparse mixture of experts to efficiently generate text.

4.3 Low-Rank Adaptation in Transformer Networks

\glsxtrfull

lora was originally conceived as a parameter-efficient way of fine-tuning \glsxtrfullplllm [Hu etal., 2021]. It is based on the observation that, while modern neural networks have huge parameter spaces, the solutions they converge to have much lower intrinsic dimension [Li etal., 2018b, Aghajanyan etal., 2020]. \glsxtrshortlora exploits this and Hu etal. [2021] show that even when fine-tuning only a low-rank update matrix BA𝐵𝐴B\!\cdot\!Aitalic_B ⋅ italic_A (sometimes with rank as low as one or two), the resulting models are competitive with much more expensive fine-tuning schemes. The method quickly became popular and has since also been extended with weight-decomposition [Liu etal., 2024]. The \glsxtrfulllora idea has been applied in various fields, notably for denoising diffusion models [Luo etal., 2023, Golnari, 2023].

As we have shown, \glsxtrshortlora’s adaptation technique naturally lends itself to parameter-efficient ensembling. We study the resulting ensemble for uncertainty calibration, a similar approach has concurrently been explored for the purpose of fine-tuning large language models [Wang etal., 2023], with promising results.

5 Discussion

On Effectiveness of \glsxtrshortlora-EnsembleAcross diverse tasks, our experiments consistently show that LoRA-Ensemble matches or surpasses the predictive performance of the state-of-the-art Explicit Ensemble while offering superior calibration.

Adding \glsxtrshortlora to a single model without any ensembling improves calibration in most experiments beyond that of a 16-member Explicit Ensemble. This effect may be linked to the well-documented over-parameterization of modern neural networks, which often achieve higher predictive accuracy at the cost of poorer calibration [e.g., Guo etal., 2017]. By incorporating LoRA while treating all pre-trained weights as constants, we significantly reduce the trainable parameter space, potentially favoring better calibration.Increasing the number of ensemble members in the \glsxtrshortlora-Ensemble enhances predictive power, potentially leading to improved accuracy, while still maintaining good calibration due to the limited number of trainable weights. However, if the trainable weights are not kept limited; for instance, by increasing the LoRA rank the calibration can worsen, as demonstrated in Fig.4.Conversely, enhancing predictive power by increasing the number of pre-trained weights (without altering the trainable weights) further improves the effectiveness of the \glsxtrshortlora-Ensemble for both accuracy and calibration, see AppendixB.

LimitationsWe propose a parameter-efficient ensembling method, which performs well in the conducted experiments.These results, however, still only heuristically show the power of the method, as there is no theoretical proof, that the members do, in fact, converge to different modes and therefore yield sufficient diversity to fully capture the underlying statistics. The presented work also leaves a number of questions that are yet to be answered. In our experiments, we did not evaluate \glsxtrshortlora-Ensemble on very large datasets, such as those often found in natural language processing. It would be interesting to see, how the method performs on such datasets. Correspondingly, it would also be useful to explore how \glsxtrshortlora-Ensemble performs on \glsxtrlongplllm, especially as these models become ever more popular. Additionally, while our method does address the restrictive memory usage of traditional ensembles, it does not reduce computational complexity. The data still needs to be passed through the model once per batch. Furthermore, it is theoretically possible to perform approximate inference on the parameter distribution of the \glsxtrshortlora matrices. This would enable drawing an infinite number of ensemble members from the approximate posterior.

Future WorkAs discussed by Rahaman and Thiery [2020], our work also suggests that in a high-parameter regime, deep ensembles may not exhibit the same behavior as they do in a low-parameter regime, where they typically improve calibration properties. We have previously witnessed this type of phase shift in bias-variance trade-off for large neural networks akin to Double Descent Phenomena [Nakkiran etal., 2021]. It would be valuable to conduct an in-depth analysis of deep ensemble behavior in high-parameter regimes, while also considering data size, model size, and compute.

6 Conclusion

We have presented \glsxtrshortlora-Ensemble, a novel, parameter-efficient method for probabilistic learning that is tailored to the transformer architecture (and potentially other architectures that make use of the attention mechanism). \glsxtrshortlora-Ensemble uses a simple, but efficient trick to turn a single base model into an implicit ensemble: the weights of the base model are kept frozen, but are modulated with the \glsxtrlonglora mechanism.By training multiple, stochastically varying instances of the low-rank matrices that define the modulation, one obtains a diverse set of ensemble members that share the majority of their weights (specifically, those of the base model) and introduces only minimal overhead through the coefficients of their individual low-rank matrices.Our experiments on two different computer vision tasks, a sound classification task, and an OOD detection task show that the proposed approach can outperform other, implicit as well as explicit, ensembling strategies in terms of both classification performance and uncertainty calibration.

Broader Impact

In recent years, the size of machine learning models has expanded rapidly. GPT-3 [Brown etal., 2020] has 175 billion parameters, while its successor, GPT-4, is rumored to contain over 1.7 trillion parameters, with training costs exceeding $100 million. As the trend toward larger models continues, growing computational resources are required.With this work, \glsxtrshortlora-Ensemble aims to contribute to more efficient ensemble methods, considering the resource usage and environmental impact of AI models. This effort strives for more sustainable practices, advancing the concept of "Green AI."

References

  • Aghajanyan etal. [2020]A.Aghajanyan, S.Gupta, and L.Zettlemoyer.Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning.In 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2020.
  • Blundell etal. [2015]C.Blundell, J.Cornebise, K.Kavukcuoglu, and D.Wierstra.Weight Uncertainty in Neural Networks.In 32nd International Conference on Machine Learning, 2015.
  • Brier [1950]G.W. Brier.Verification Of Forecasts Expressed In Terms Of Probability.Monthly Weather Review, 78, 1950.
  • Brown etal. [2020]T.B. Brown, B.Mann, N.Ryder, M.Subbiah, J.Kaplan, P.Dhariwal, A.Neelakantan, P.Shyam, G.Sastry, A.Askell, S.Agarwal, A.Herbert-Voss, G.Krueger, T.Henighan, R.Child, A.Ramesh, D.M. Ziegler, J.Wu, C.Winter, C.Hesse, M.Chen, E.Sigler, M.Litwin, S.Gray, B.Chess, J.Clark, C.Berner, S.McCandlish, A.Radford, I.Sutskever, and D.Amodei.Language Models are Few-Shot Learners.In Advances in Neural Information Processing Systems, 2020.
  • Chen etal. [2024]A.Chen, H.Yang, Y.Gan, D.A. Gudovskiy, Z.Dong, H.Wang, T.Okuno, Y.Nakata, K.Keutzer, and S.Zhang.Split-Ensemble: Efficient OOD-aware Ensemble via Task and Model Splitting.In Proceedings of the 41st International Conference on Machine Learning, 2024.
  • Chen etal. [2014]T.Chen, E.B. Fox, and C.Guestrin.Stochastic Gradient Hamiltonian Monte Carlo.In 31st International Conference on Machine Learning, 2014.
  • Conrad [2023]B.Conrad.Fine-tuning Vision Transformers, 2023.URL https://github.com/bwconrad/vit-finetune.Accessed: 2024-05-20.
  • Cui etal. [2019]Y.Cui, M.Jia, T.Y. Lin, Y.Song, and S.Belongie.Class-balanced loss based on effective number of samples.In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019.
  • Davis and Goadrich [2006]J.Davis and M.Goadrich.The Relationship between Precision-Recall and ROC Curves.In Proceedings of the 23rd International Conference on Machine Learning, 2006.
  • Deng etal. [2009]J.Deng, W.Dong, R.Socher, L.-J. Li, Kai Li, and Li Fei-Fei.ImageNet: A large-scale hierarchical image database.In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • DerKiureghian and Ditlevsen [2009]A.DerKiureghian and O.Ditlevsen.Aleatory or epistemic? Does it matter?Structural Safety, 31(2), 2009.
  • Dosovitskiy etal. [2020]A.Dosovitskiy, L.Beyer, A.Kolesnikov, D.Weissenborn, X.Zhai, T.Unterthiner, M.Dehghani, M.Minderer, G.Heigold, S.Gelly, J.Uszkoreit, and N.Houlsby.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.In 9th International Conference on Learning Representations, 2020.
  • Durasov etal. [2020]N.Durasov, T.Bagautdinov, P.Baque, and P.Fua.Masksembles for Uncertainty Estimation.In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020.
  • Fort etal. [2019]S.Fort, H.Hu, and B.Lakshminarayanan.Deep Ensembles: A Loss Landscape Perspective, 2019.arXiv: 1912.02757.
  • Gal and Ghahramani [2015]Y.Gal and Z.Ghahramani.Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.In 33rd International Conference on Machine Learning, 2015.
  • Gemmeke etal. [2017]J.F. Gemmeke, D.P. Ellis, D.Freedman, A.Jansen, W.Lawrence, R.C. Moore, M.Plakal, and M.Ritter.Audio Set: An ontology and human-labeled dataset for audio events.In IEEE International Conference on Acoustics, Speech and Signal Processing, 2017.
  • Glorot and Bengio [2010]X.Glorot and Y.Bengio.Understanding the difficulty of training deep feedforward neural networks.In 13th International Conference on Artificial Intelligence and Statistics, 2010.
  • Golnari [2023]P.A. Golnari.LoRA-Enhanced Distillation on Guided Diffusion Models, 2023.arXiv: 2312.06899.
  • Gong etal. [2021]Y.Gong, Y.A. Chung, and J.Glass.AST: Audio Spectrogram Transformer.In Annual Conference of the International Speech Communication Association, 2021.
  • Graves [2011]A.Graves.Practical Variational Inference for Neural Networks.In Advances in Neural Information Processing Systems, 2011.
  • Guo etal. [2017]C.Guo, G.Pleiss, Y.Sun, and K.Q. Weinberger.On Calibration of Modern Neural Networks.In 34th International Conference on Machine Learning, 2017.
  • Gustafsson etal. [2019]F.K. Gustafsson, M.Danelljan, and T.B. Schon.Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision.In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2019.
  • Hanley and McNeil [1982]J.A. Hanley and B.J. McNeil.The meaning and use of the area under a receiver operating characteristic (ROC) curve.Radiology, 143(1), 1982.
  • Havasi etal. [2020]M.Havasi, R.Jenatton, S.Fort, J.Z. Liu, J.Snoek, B.Lakshminarayanan, A.M. Dai, and D.Tran.Training independent subnetworks for robust prediction.In 9th International Conference on Learning Representations, 2020.
  • Hendrycks and Gimpel [2016]D.Hendrycks and K.Gimpel.A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136, 2016.
  • Hu etal. [2021]E.Hu, Y.Shen, P.Wallis, Z.Allen-Zhu, Y.Li, S.Wang, L.Wang, and W.Chen.LoRA: Low-Rank Adaptation of Large Language Models.In 10th International Conference on Learning Representations, 2021.
  • Huang etal. [2017]G.Huang, Y.Li, G.Pleiss, Z.Liu, J.E. Hopcroft, and K.Q. Weinberger.Snapshot Ensembles: Train 1, Get M for Free.In International Conference on Learning Representations, 2017.
  • Izmailov etal. [2021]P.Izmailov, S.Vikram, M.D. Hoffman, and A.G. Wilson.What Are Bayesian Neural Network Posteriors Really Like?In Proceedings of Machine Learning Research, 2021.
  • Jiang etal. [2024]A.Q. Jiang, A.Sablayrolles, A.Roux, A.Mensch, B.Savary, C.Bamford, D.S. Chaplot, D.d.l. Casas, E.B. Hanna, F.Bressand, G.Lengyel, G.Bour, G.Lample, L.R. Lavaud, L.Saulnier, M.-A. Lachaux, P.Stock, S.Subramanian, S.Yang, S.Antoniak, T.L. Scao, T.Gervet, T.Lavril, T.Wang, T.Lacroix, and W.E. Sayed.Mixtral of Experts, 2024.arXiv: 2401.04088.
  • Krizhevsky [2009]A.Krizhevsky.Learning Multiple Layers of Features from Tiny Images.University of Toronto, 2009.
  • Lakshminarayanan etal. [2017]B.Lakshminarayanan, A.Pritzel, and C.B. Deepmind.Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles.In Advances in Neural Information Processing Systems, 2017.
  • Li etal. [2023]B.Li, Y.Hu, X.Nie, C.Han, X.Jiang, T.Guo, and L.Liu.Dropkey for vision transformer.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • Li etal. [2018a]C.Li, H.Farkhoor, R.Liu, and J.Yosinski.Measuring the Intrinsic Dimension of Objective Landscapes.In 6th International Conference on Learning Representations, 2018a.
  • Li etal. [2018b]Y.Li, N.Wang, J.Shi, X.Hou, and J.Liu.Adaptive Batch Normalization for practical domain adaptation.Pattern Recognition, 80, 2018b.
  • Liu etal. [2024]S.-Y. Liu, C.-Y. Wang, H.Yin, P.Molchanov, Y.-C.F. Wang, K.-T. Cheng, and M.-H. Chen.DoRA: Weight-Decomposed Low-Rank Adaptation, 2024.arxiv: 2402.09353.
  • Loshchilov and Hutter [2017]I.Loshchilov and F.Hutter.Decoupled Weight Decay Regularization.In 7th International Conference on Learning Representations, 2017.
  • Luo etal. [2023]S.Luo, Y.Tan, S.Patil, D.Gu, P.von Platen, A.Passos, L.Huang, J.Li, and H.Zhao.LCM-LoRA: A Universal Stable-Diffusion Acceleration Module, 2023.arXiv: 2311.05556.
  • Nakkiran etal. [2021]P.Nakkiran, G.Kaplun, Y.Bansal, T.Yang, B.Barak, and I.Sutskever.Deep double descent: where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12), 2021.
  • Neal [1996]R.M. Neal.Bayesian Learning for Neural Networks.Lecture Notes in Statistics. Springer New York, 1996.
  • Piczak [2015]K.J. Piczak.ESC: Dataset for environmental sound classification.In Proceedings of the 2015 ACM Multimedia Conference, 2015.
  • Rahaman and Thiery [2020]R.Rahaman and A.H. Thiery.Uncertainty Quantification and Deep Ensembles.In Advances in Neural Information Processing Systems, 2020.
  • Ranganath etal. [2014]R.Ranganath, S.Gerrish, and D.M. Blei.Black Box Variational Inference.In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, 2014.
  • Sim etal. [2023]M.Sim, J.Lee, and H.-J. Choi.Attention Masking for Improved Near Out-of-Distribution Image Detection.In IEEE International Conference on Big Data and Smart Computing (BigComp), 2023.
  • Takeda etal. [2021]M.Takeda, G.Benitez, and K.Yanai.Training of multiple and mixed tasks with a single network using feature modulation.In International Conference on Pattern Recognition, 2021.
  • Touvron etal. [2020]H.Touvron, M.Cord, M.Douze, F.Massa, A.Sablayrolles, and H.Jégou.Training data-efficient image transformers & distillation through attention.In Proceedings of Machine Learning Research, 2020.
  • Tschandl etal. [2018]P.Tschandl, C.Rosendahl, and H.Kittler.The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific Data, 5, 2018.
  • Turkoglu etal. [2022]M.O. Turkoglu, A.Becker, H.A. Gündüz, M.Rezaei, B.Bischl, R.C. Daudt, S.D’Aronco, J.D. Wegner, and K.Schindler.FiLM-Ensemble: Probabilistic Deep Learning via Feature-wise Linear Modulation.In Advances in Neural Information Processing Systems, 2022.
  • Vaswani etal. [2017]A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A.N. Gomez, L.Kaiser, and I.Polosukhin.Attention Is All You Need.In Advances in Neural Information Processing Systems, 2017.
  • Wang etal. [2023]X.Wang, L.Aitchison, and M.Rudolph.LoRA ensembles for large language model fine-tuning, 2023.arXiv: 2310.00035.
  • Welling etal. [2011]M.Welling, D.Bren, and Y.W. Teh.Bayesian Learning via Stochastic Gradient Langevin Dynamics.In 28th International Conference on International Conference on Machine Learning, 2011.
  • Wen etal. [2020]Y.Wen, D.Tran, and J.Ba.BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning.In 8th International Conference on Learning Representations, 2020.
  • Wenzel etal. [2020]F.Wenzel, J.Snoek, D.Tran, and R.Jenatton.Hyperparameter Ensembles for Robustness and Uncertainty Quantification.In Advances in Neural Information Processing Systems, 2020.
  • Wilson and Izmailov [2020]A.G. Wilson and P.Izmailov.Bayesian Deep Learning and a Probabilistic Perspective of Generalization.In Advances in Neural Information Processing Systems, 2020.

Appendix A More Experiments & Results

This section presents comprehensive experimental results for a new dataset: CIFAR-10 and includes additional figures for the HAM10000 dataset.

A.1 CIFAR-10

The results for the CIFAR-10 dataset, as shown in Tab.6, indicate that \glsxtrshortlora-Ensemble outperforms all other methods across all metrics. Following closely is a single network enhanced with \glsxtrshortlora. This mirrors the results found in the main paper for CIFAR-100, with the exception of the calibration for a single model. It is important to note that although all methods achieve high accuracy and the differences between them are minimal, calibration is nearly perfect for most approaches. This suggests that the CIFAR-10 dataset is relatively easy for modern transformer models, and the results should not be over-interpreted. Nevertheless, the consistent performance across different random seeds suggests that the ranking is likely significant. Given the balanced nature of the CIFAR-10 dataset, the accuracy and F1-score are almost identical.

MethodAccuracy (\uparrow)F1 (\uparrow)ECE (\downarrow)NLL (\downarrow)Brier (\downarrow)
Single Network92.8±0.1plus-or-minus92.80.192.8\pm 0.192.8 ± 0.192.8±0.1plus-or-minus92.80.192.8\pm 0.192.8 ± 0.10.051±0.001plus-or-minus0.0510.0010.051\pm 0.0010.051 ± 0.0010.333±0.003plus-or-minus0.3330.0030.333\pm 0.0030.333 ± 0.0030.120±0.002plus-or-minus0.1200.0020.120\pm 0.0020.120 ± 0.002
Single Net w/ LoRA94.5¯±0.0plus-or-minus¯94.50.0\underline{94.5}\pm 0.0under¯ start_ARG 94.5 end_ARG ± 0.094.5¯±0.0plus-or-minus¯94.50.0\underline{94.5}\pm 0.0under¯ start_ARG 94.5 end_ARG ± 0.00.009¯±0.001plus-or-minus¯0.0090.001\underline{0.009}\pm 0.001under¯ start_ARG 0.009 end_ARG ± 0.0010.163¯±0.002plus-or-minus¯0.1630.002\underline{0.163}\pm 0.002under¯ start_ARG 0.163 end_ARG ± 0.0020.082¯±0.001plus-or-minus¯0.0820.001\underline{0.082}\pm 0.001under¯ start_ARG 0.082 end_ARG ± 0.001
MC Dropout92.9±0.2plus-or-minus92.90.292.9\pm 0.292.9 ± 0.292.9±0.2plus-or-minus92.90.292.9\pm 0.292.9 ± 0.20.023±0.002plus-or-minus0.0230.0020.023\pm 0.0020.023 ± 0.0020.260±0.005plus-or-minus0.2600.0050.260\pm 0.0050.260 ± 0.0050.110±0.003plus-or-minus0.1100.0030.110\pm 0.0030.110 ± 0.003
Explicit Ensemble94.1±0.1plus-or-minus94.10.194.1\pm 0.194.1 ± 0.194.1±0.1plus-or-minus94.10.194.1\pm 0.194.1 ± 0.10.031±0.001plus-or-minus0.0310.0010.031\pm 0.0010.031 ± 0.0010.181±0.002plus-or-minus0.1810.0020.181\pm 0.0020.181 ± 0.0020.087±0.001plus-or-minus0.0870.0010.087\pm 0.0010.087 ± 0.001
Snapshot Ensemble93.1±0.1plus-or-minus93.10.193.1\pm 0.193.1 ± 0.193.1±0.1plus-or-minus93.10.193.1\pm 0.193.1 ± 0.10.037±0.002plus-or-minus0.0370.0020.037\pm 0.0020.037 ± 0.0021.062±0.021plus-or-minus1.0620.0211.062\pm 0.0211.062 ± 0.0210.510±0.008plus-or-minus0.5100.0080.510\pm 0.0080.510 ± 0.008
LoRA-Ensemble95.9±0.1plus-or-minus95.90.1\textbf{95.9}\pm 0.195.9 ± 0.195.9±0.1plus-or-minus95.90.1\textbf{95.9}\pm 0.195.9 ± 0.10.003±0.001plus-or-minus0.0030.001\textbf{0.003}\pm 0.0010.003 ± 0.0010.128±0.001plus-or-minus0.1280.001\textbf{0.128}\pm 0.0010.128 ± 0.0010.064±0.000plus-or-minus0.0640.000\textbf{0.064}\pm 0.0000.064 ± 0.000

A.2 HAM10000 Lesion Classification

Classification accuracy and \glsxtrshortece for HAM10000 dataset are both graphed against ensemble size in Fig.6. Again, \glsxtrshortlora-Ensemble outperforms all baselines for larger ensembles. In Fig.5 the reliability diagrams for \glsxtrshortlora-Ensemble and an Explicit Ensemble with 16 members each on the HAM10000 dataset are shown. Here, the models are overconfident, further supporting our reasoning regarding the surprising behaviour of calibration with growing ensemble size in the case of CIFAR-100.

LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (7)
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (8)
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (9)
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (10)

Appendix B Effect of Model Size on Prediction and Calibration Performance

Building upon our existing experiments with the HAM10000 dataset, we extended our analysis to include different backbone architectures with varying numbers of parameters. Specifically, we utilized various DeiT models pre-trained with distillation, as described by Touvron etal. [2020]. The results are presented in Table7. Notably, the DeiT Base-32 model is the same as the ViT Base-32 model.

In the small parameter regime (Tiny-16, Small-16), the addition of a single LoRA module did not consistently enhance calibration compared to using a single model. This observation contrasts with our findings in most other experiments. However, in the larger parameter regime (ViT Base-32), incorporating even a single LoRA module significantly improved calibration.

Furthermore, increasing the number of ensembles in the LoRA-Ensemble not only boosted accuracy but also enhanced calibration, enabling it to match the performance of an Explicit Ensemble in both parameter regimes. Finally, as the number of parameters in the backbone architecture increased, the superiority of the LoRA-Ensemble over the Explicit Ensemble in terms of both accuracy and calibration became more pronounced.

{adjustwidth}

-32.93ptArch.Method# Params.Accuracy (\uparrow)F1 (\uparrow)ECE (\downarrow)NLL(\downarrow)Brier (\downarrow)DeiTTiny-16Single Net5M5M5\,\mathrm{M}5 roman_M89.0¯±0.3plus-or-minus¯89.00.3\underline{89.0}\pm 0.3under¯ start_ARG 89.0 end_ARG ± 0.379.0±0.4plus-or-minus79.00.479.0\pm 0.479.0 ± 0.40.096±0.003plus-or-minus0.0960.0030.096\pm 0.0030.096 ± 0.0030.909±0.037plus-or-minus0.9090.0370.909\pm 0.0370.909 ± 0.0370.202±0.005plus-or-minus0.2020.0050.202\pm 0.0050.202 ± 0.005Single Net w/ LoRA84.5±0.8plus-or-minus84.50.884.5\pm 0.884.5 ± 0.871.6±1.5plus-or-minus71.61.571.6\pm 1.571.6 ± 1.50.074±0.003plus-or-minus0.0740.0030.074\pm 0.0030.074 ± 0.0030.542±0.017plus-or-minus0.5420.0170.542\pm 0.0170.542 ± 0.0170.237±0.009plus-or-minus0.2370.0090.237\pm 0.0090.237 ± 0.009Explicit Ensemble90.4±0.3plus-or-minus90.40.3\textbf{90.4}\pm 0.390.4 ± 0.381.4±0.4plus-or-minus81.40.4\textbf{81.4}\pm 0.481.4 ± 0.40.069¯±0.004plus-or-minus¯0.0690.004\underline{0.069}\pm 0.004under¯ start_ARG 0.069 end_ARG ± 0.0040.340¯±0.006plus-or-minus¯0.3400.006\underline{0.340}\pm 0.006under¯ start_ARG 0.340 end_ARG ± 0.0060.142±0.002plus-or-minus0.1420.002\textbf{0.142}\pm 0.0020.142 ± 0.002LoRA-Ensemble88.9±0.4plus-or-minus88.90.488.9\pm 0.488.9 ± 0.480.6¯±0.2plus-or-minus¯80.60.2\underline{80.6}\pm 0.2under¯ start_ARG 80.6 end_ARG ± 0.20.025±0.003plus-or-minus0.0250.003\textbf{0.025}\pm 0.0030.025 ± 0.0030.325±0.004plus-or-minus0.3250.004\textbf{0.325}\pm 0.0040.325 ± 0.0040.164¯±0.002plus-or-minus¯0.1640.002\underline{0.164}\pm 0.002under¯ start_ARG 0.164 end_ARG ± 0.002DeiTSmall-16Single Net22M22M22\,\mathrm{M}22 roman_M89.6±0.4plus-or-minus89.60.489.6\pm 0.489.6 ± 0.479.0±0.5plus-or-minus79.00.579.0\pm 0.579.0 ± 0.50.093±0.003plus-or-minus0.0930.0030.093\pm 0.0030.093 ± 0.0030.876±0.032plus-or-minus0.8760.0320.876\pm 0.0320.876 ± 0.0320.191±0.007plus-or-minus0.1910.0070.191\pm 0.0070.191 ± 0.007Single Net w/ LoRA86.3±0.5plus-or-minus86.30.586.3\pm 0.586.3 ± 0.576.8±1.0plus-or-minus76.81.076.8\pm 1.076.8 ± 1.00.100±0.007plus-or-minus0.1000.0070.100\pm 0.0070.100 ± 0.0070.731±0.053plus-or-minus0.7310.0530.731\pm 0.0530.731 ± 0.0530.234±0.010plus-or-minus0.2340.0100.234\pm 0.0100.234 ± 0.010Explicit Ensemble91.5±0.1plus-or-minus91.50.1\textbf{91.5}\pm 0.191.5 ± 0.182.4¯±0.2plus-or-minus¯82.40.2\underline{82.4}\pm 0.2under¯ start_ARG 82.4 end_ARG ± 0.20.061¯±0.002plus-or-minus¯0.0610.002\underline{0.061}\pm 0.002under¯ start_ARG 0.061 end_ARG ± 0.0020.318¯±0.003plus-or-minus¯0.3180.003\underline{0.318}\pm 0.003under¯ start_ARG 0.318 end_ARG ± 0.0030.130±0.001plus-or-minus0.1300.001\textbf{0.130}\pm 0.0010.130 ± 0.001LoRA-Ensemble90.4¯±0.1plus-or-minus¯90.40.1\underline{90.4}\pm 0.1under¯ start_ARG 90.4 end_ARG ± 0.182.8±0.4plus-or-minus82.80.4\textbf{82.8}\pm 0.482.8 ± 0.40.047±0.002plus-or-minus0.0470.002\textbf{0.047}\pm 0.0020.047 ± 0.0020.292±0.002plus-or-minus0.2920.002\textbf{0.292}\pm 0.0020.292 ± 0.0020.144¯±0.001plus-or-minus¯0.1440.001\underline{0.144}\pm 0.001under¯ start_ARG 0.144 end_ARG ± 0.001DeiTBase-32Single Net86M86M86\,\mathrm{M}86 roman_M84.1±0.3plus-or-minus84.10.384.1\pm 0.384.1 ± 0.371.4±0.7plus-or-minus71.40.771.4\pm 0.771.4 ± 0.70.139±0.004plus-or-minus0.1390.0040.139\pm 0.0040.139 ± 0.0041.138±0.040plus-or-minus1.1380.0401.138\pm 0.0401.138 ± 0.0400.291±0.009plus-or-minus0.2910.0090.291\pm 0.0090.291 ± 0.009Single Net w/ LoRA83.2±0.7plus-or-minus83.20.783.2\pm 0.783.2 ± 0.770.7±1.3plus-or-minus70.71.370.7\pm 1.370.7 ± 1.30.085¯±0.004plus-or-minus¯0.0850.004\underline{0.085}\pm 0.004under¯ start_ARG 0.085 end_ARG ± 0.0040.569±0.027plus-or-minus0.5690.0270.569\pm 0.0270.569 ± 0.0270.256±0.011plus-or-minus0.2560.0110.256\pm 0.0110.256 ± 0.011Explicit Ensemble85.8¯±0.2plus-or-minus¯85.80.2\underline{85.8}\pm 0.2under¯ start_ARG 85.8 end_ARG ± 0.274.6¯±0.4plus-or-minus¯74.60.4\underline{74.6}\pm 0.4under¯ start_ARG 74.6 end_ARG ± 0.40.105±0.002plus-or-minus0.1050.0020.105\pm 0.0020.105 ± 0.0020.536¯±0.007plus-or-minus¯0.5360.007\underline{0.536}\pm 0.007under¯ start_ARG 0.536 end_ARG ± 0.0070.218¯±0.002plus-or-minus¯0.2180.002\underline{0.218}\pm 0.002under¯ start_ARG 0.218 end_ARG ± 0.002LoRA-Ensemble88.0±0.2plus-or-minus88.00.2\textbf{88.0}\pm 0.288.0 ± 0.278.3±0.6plus-or-minus78.30.6\textbf{78.3}\pm 0.678.3 ± 0.60.037±0.002plus-or-minus0.0370.002\textbf{0.037}\pm 0.0020.037 ± 0.0020.342±0.003plus-or-minus0.3420.003\textbf{0.342}\pm 0.0030.342 ± 0.0030.175±0.002plus-or-minus0.1750.002\textbf{0.175}\pm 0.0020.175 ± 0.002

Appendix C More Sensitivity Analysis: LoRA Rank

As discussed in the paper, varying the rank of the low-rank decomposition in \glsxtrshortlora allows for modulation of the model size. We investigated the effect of rank on predictive accuracy and uncertainty calibration for \glsxtrshortlora-Ensemble. The results for the HAM10000 dataset are presented in the main paper, Sec.3.6.For the CIFAR-100 dataset, our evaluation of \glsxtrshortlora-Ensemble shows both increased accuracy and improved calibration with increasing rank within the studied range. These findings are illustrated in Fig.7.

LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (11)

This observation aligns with the findings of Rahaman and Thiery [2020], as \glsxtrshortlora-Ensemble continues to exhibit under-confidence even at higher ranks. Increasing model complexity enhances confidence, thereby improving calibration. However, at rank 32, the calibration of a single network augmented with \glsxtrshortlora begins to deteriorate, suggesting that a critical boundary has been reached. Beyond this point, the parameter space becomes insufficiently constrained, leading to effects similar to those observed by Guo etal. [2017].

At higher ranks, accuracy plateaus while memory demand increases linearly with 𝒪(d)𝒪𝑑\mathcal{O}(d)caligraphic_O ( italic_d ) and 𝒪(k)𝒪𝑘\mathcal{O}(k)caligraphic_O ( italic_k ) for Ar×d𝐴superscript𝑟𝑑A\in\mathbb{R}^{r\times d}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_d end_POSTSUPERSCRIPT and Bk×r𝐵superscript𝑘𝑟B\in\mathbb{R}^{k\times r}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_r end_POSTSUPERSCRIPT respectively, where d𝑑ditalic_d and k𝑘kitalic_k are the dimensions of the pre-trained weight matrix W0k×dsubscript𝑊0superscript𝑘𝑑W_{0}\in\mathbb{R}^{k\times d}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d end_POSTSUPERSCRIPT. Consequently, we selected rank 8 for our CIFAR-100 experiments.

Appendix D Training Details

The CIFAR-10/100 and HAM10000 dataset experiments are based on the ViT-Base-32 architecture [Dosovitskiy etal., 2020]. This model has 12 layers and uses 768-dimensional patch embeddings, and the multi-head attention modules have 12 heads. All \glsxtrlongvit models for image classification are trained using the AdamW optimizer [Loshchilov and Hutter, 2017]. The base learning rate is initially set to 0.0001. The training uses a learning rate warm-up of 500 steps, where the learning rate increases linearly from 0 to the base learning rate before switching to a cosine decline over the rest of the steps. During the experiments, the gradients were calculated and then clipped not to exceed a maximum norm of 1. In the case of HAM10000, we used a weighted cross entropy loss that considered the estimated effective number of samples, which was determined using a beta parameter of 0.9991 [Cui etal., 2019]. Uniform class weights were used for all other datasets. The maximum number of training epochs varies depending on the dataset. For CIFAR-100, the model is trained for 16 epochs (just over 25000 steps), while on HAM10000, it is trained for 65 epochs. Overall, the hyperparameters used in this work were loosely based on Conrad [2023]. The models were trained using pre-trained weights from torchvision 0.17.1 on an NVIDIA Tesla A100 graphics card. Moreover, the \glsxtrshortlora models were configured with a rank of 8 for both CIFAR-10 and CIFAR-100 and a rank of 4 for HAM10000. For Monte Carlo Dropout the dropout rate was empirically set to be 0.20.20.20.2. Refer to AppendixK for details.

The settings used for the ESC-50 dataset training are similar to those used in Gong etal. [2021]. However, we used a batch size of 1 instead of 48 to enable training on a single GPU. The base learning rate is set to 0.00001 for the Explicit Ensemble as well as MC Dropout experiments and 0.00005 for \glsxtrshortlora-Ensemble. These learning rates are lower than the ones used in Gong etal. [2021], which is due to the smaller batch size. Refer to the AppendixI for more details. The \glsxtrshortlora models were implemented with a rank of 16. The dropout rate for MC dropout was kept at 0.20.20.20.2.

As Fort etal. [2019] have shown, varying initializations of the weights are most important to getting diverse ensemble members. For this reason, various initialization methods and corresponding parameters were tried, with a Xavier uniform initialization [Glorot and Bengio, 2010] with gain 10, giving the best combination of accuracy and calibration. For more information, refer to AppendixE. This setting is kept for models across all datasets, including the one with an \glsxtrshortast backbone.

For the same reason, we investigated whether adding noise to the pre-trained parameters of an Explicit Ensemble increases its performance through a higher diversity of members. However, the results did not show any additional benefits beyond what the randomly initialized last layer already provided. Therefore, it was not utilized. For more details, refer to AppendixF.

Appendix E Initialization of LoRA-Ensemble Parameters

Randomness in initialization is a key driver of diversity among ensemble members [Fort etal., 2019]. Therefore, finding the right balance between diversity and overly disrupting parameters is crucial. Hu etal. [2021] propose using a random Gaussian initialization for A𝐴Aitalic_A while setting B𝐵Bitalic_B to zero. This approach results in ΔW=BAΔ𝑊𝐵𝐴\Delta W=BAroman_Δ italic_W = italic_B italic_A being zero at the start of training. In our experiments, we adopt this pattern by always initializing B𝐵Bitalic_B to zero while varying the parameters and methods for initializing A𝐴Aitalic_A.Following the method outlined by Hu etal. [2021], our initial experiments concentrated on the Gaussian initialization of A𝐴Aitalic_A, with a mean μ=0𝜇0\mu=0italic_μ = 0 and varying standard deviations.Additionally, we tested the Xavier uniform initialization [Glorot and Bengio, 2010] using different values for the gain.All tests were conducted on the CIFAR-100 dataset and subsequently applied to other experiments.We compared results in terms of accuracy and \glsxtrlongece.

Init. TypeStd. / GainAccuracy (\uparrow)ECE (\downarrow)
Gaussian0.0281.20.041
0.0581.40.037
0.181.70.035
0.282.10.034
0.582.60.036
182.50.039
281.70.046
Xavier Uniform181.50.039
582.20.034
1082.40.034
1582.60.037
2082.40.038
3082.20.043

In Tab.8, the results are quantitatively presented. It is immediately evident that both techniques and all tested parameters perform similarly. While more specialized models may surpass our results in terms of accuracy, our primary focus is on calibration, with the goal of maintaining comparable predictive performance. Visual inspection of the results in Fig.8 confirms the high similarity among all results.Choosing a small calibration error while maintaining high accuracy as a decision criterion, both Gaussian initialization with a standard deviation of 0.50.50.50.5 and Xavier uniform initialization with a gain of 10101010 or 15151515 are viable candidates. Since a gain of 10101010 combines high accuracy with the lowest \glsxtrlongece, we select Xavier uniform initialization with a gain of 10101010 for our experiments.

LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (12)
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (13)

Appendix F Initialization of Explicit Ensemble Parameters

A pre-trained \glsxtrlongvit model is the backbone for our computer vision experiments. Correspondingly, the parameters of all members in an Explicit Ensemble are initialized to the same values across members. Initialization is a primary driver of diversity in ensemble members [Fort etal., 2019]. Hence, it is crucial to study the effect of noise in the parameter initialization on the calibration of the resulting ensemble.In the case of pre-trained model weights not having been trained on a dataset with the same number of classes, the last layer of all models is replaced completely. This means that regardless of the ensemble technique used, the weights of the last layer, which is responsible for classification, will vary across members. This variation in the weights of the classification layer is expected to contribute significantly to the diversity of the members.Nonetheless, we studied the impact of adding noise to the parameters of an Explicit Ensemble. This was done using the following formula:

Wnew=W+αdW,subscript𝑊new𝑊𝛼𝑑𝑊W_{\mathrm{new}}=W+\alpha\cdot dW\,,italic_W start_POSTSUBSCRIPT roman_new end_POSTSUBSCRIPT = italic_W + italic_α ⋅ italic_d italic_W ,(5)

where dW𝒩(0,σW)similar-to𝑑𝑊𝒩0subscript𝜎𝑊dW\sim\mathcal{N}(0,\sigma_{W})italic_d italic_W ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ). Here α𝛼\alphaitalic_α is a scale factor to control the amount of noise and σWsubscript𝜎𝑊\sigma_{W}italic_σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT is the standard deviation of the parameters within a weight matrix. This was applied to all weight matrices separately.

It is expected that the initial layers of a neural network will learn basic features, while the later layers will include dataset-specific properties. Therefore, it is assumed that adding noise to the later layers would increase diversity while maintaining pre-training. However, adding noise to the earlier layers might disrupt pre-training more significantly, especially with smaller datasets, as these parameters may not converge to meaningful values again. To address this, an experiment was set up where noise was added only to the last encoder layers of the model, increasing the number of affected encoder layers gradually. Additionally, several different noise scales α𝛼\alphaitalic_α were tried, ranging from 1111 to 0.00010.00010.00010.0001.In the presented experiment, the last classification layer is initialized using PyTorch’s default method for linear layers. At the time of writing it is as follows:

Winitsubscript𝑊init\displaystyle W_{\mathrm{init}}italic_W start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT=Unif(53fan_in,53fan_in)absentUnif53𝑓𝑎𝑛_𝑖𝑛53𝑓𝑎𝑛_𝑖𝑛\displaystyle=\mathrm{Unif}\left(-\sqrt{5}\cdot\sqrt{\frac{3}{fan\_in}},\sqrt{%5}\cdot\sqrt{\frac{3}{fan\_in}}\right)= roman_Unif ( - square-root start_ARG 5 end_ARG ⋅ square-root start_ARG divide start_ARG 3 end_ARG start_ARG italic_f italic_a italic_n _ italic_i italic_n end_ARG end_ARG , square-root start_ARG 5 end_ARG ⋅ square-root start_ARG divide start_ARG 3 end_ARG start_ARG italic_f italic_a italic_n _ italic_i italic_n end_ARG end_ARG )(6)
Binitsubscript𝐵init\displaystyle B_{\mathrm{init}}italic_B start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT=Unif(1fan_in,1fan_in).absentUnif1𝑓𝑎𝑛_𝑖𝑛1𝑓𝑎𝑛_𝑖𝑛\displaystyle=\mathrm{Unif}\left(-\sqrt{\frac{1}{fan\_in}},\sqrt{\frac{1}{fan%\_in}}\right).= roman_Unif ( - square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_f italic_a italic_n _ italic_i italic_n end_ARG end_ARG , square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_f italic_a italic_n _ italic_i italic_n end_ARG end_ARG ) .(7)

Here W𝑊Witalic_W specifies the weight matrix and B𝐵Bitalic_B is the bias.Experiments are conducted on the CIFAR-100 dataset.

F.1 Results

The most important metrics for this section are accuracy and \glsxtrlongece. The results for adding noise to the last layer up to the last five layers are summarized in Fig.9. Fig.9(a) depicts the results for a single model, while Fig.9(b) shows the results for an ensemble of 16 members.

It is evident that none of the experiments surpass the baseline of not using any additional noise beyond the random initialization of the last classification layer. After the last five layers, the results become uninteresting, as they do not vary significantly from those shown in the plots. Therefore, the presentation is truncated at five layers.Based on the presented results, no additional noise is injected into the Explicit Ensemble, and only the last layer initialization is varied.

LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (14)
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (15)

Appendix G AST Implementation

A different backbone is used for the experiment on the audio dataset. Specifically, we use the \glsxtrfullast following the implementation of Gong etal. [2021], with slight modifications to fit our general architecture. AppendixH demonstrates the equivalence of our implementation. In their experiments, Gong etal. [2021] used two different types of pre-trained weights: one pre-trained on a large image dataset and the other on an audio dataset. For our research, we transfer the weights of a vision transformer model known as \glsxtrshortdeit [Touvron etal., 2020], which has been pre-trained on the ImageNet dataset [Deng etal., 2009], to the original \glsxtrshortast architecture by Gong etal. [2021]. The model has 12 layers, uses 768-dimensional patch embeddings, and the multi-head attention modules have 12 heads. This task is considered more challenging than using models pre-trained on audio datasets.

Appendix H Validation of AST Implementation

The \glsxtrfullast model provided by Gong etal. [2021] was copied without any changes. However, the training and evaluation pipeline was adapted to fit our architecture. Correspondingly, it was essential to validate the equivalence of our implementation by training a single \glsxtrshortast on the ESC-50 dataset. The results of our model should closely match those provided in Gong etal. [2021].

They offer two sets of pre-trained weights: one where the weights of a \glsxtrlongvit pre-trained on ImageNet [Deng etal., 2009] are transferred to \glsxtrshortast, and another where the \glsxtrshortast was pre-trained on AudioSet [Gemmeke etal., 2017]. To verify our implementation, we ran it using the settings provided by Gong etal. [2021] and compared the results, which are summarized in Tab.9.The results for both pre-training modes fall within the uncertainty range provided by Gong etal. [2021]. This suggests that our pipeline yields comparable outcomes, validating our implementation for continued use.

ModelAccuracy [Gong etal., 2021]Accuracy (our implementation)
AST-S88.7±0.7plus-or-minus88.70.788.7\pm 0.788.7 ± 0.788.088.088.088.0
AST-P95.6±0.4plus-or-minus95.60.495.6\pm 0.495.6 ± 0.495.895.895.895.8

Appendix I Hyper-parameter Tuning for AST Experiment

The original training settings of the AST-S model in Gong etal. [2021] utilize a batch size of 48. However, due to the memory constraint of single GPU training on an NVIDIA Tesla A100 with 80 GB memory, replicating a batch size of 48 as in the original publication was infeasible for training an Explicit AST-S Ensemble with 8 members. Consequently, we perform minimal hyper-parameter tuning by employing a batch size of 1 for both the explicit AST-S and the \glsxtrshortlora AST-S model, exploring various learning rates. Apart from batch size and learning rate adjustments, all other settings remain consistent with Gong etal. [2021].

The hyper-parameter tuning results for the explicit model using a batch size of 1, as shown in Tab.10, demonstrate performance similar to the original implementation with a batch size of 48, allowing for a fair comparison with our method [Gong etal., 2021]. Additionally, Tab.11 showcases the outcomes of tuning the learning rate for our \glsxtrshortlora AST-S model.

ModelLearning rateAccuracy (\uparrow)ECE (\downarrow)
AST-S0.0000188.20.0553
AST-S0.0000581.781.781.781.70.09330.09330.09330.0933
ModelLearning rateAccuracy (\uparrow)ECE (\downarrow)
LoRA AST-S0.0000185.685.685.685.60.04470.04470.04470.0447
LoRA AST-S0.0000587.90.0487
LoRA AST-S0.000184.784.784.784.70.05010.05010.05010.0501
LoRA AST-S0.000524.124.124.124.10.02910.02910.02910.0291
LoRA AST-S0.00111.811.811.811.80.02950.02950.02950.0295

Appendix J Computational Cost for AST Models

Similarly to the way we did for the \glsxtrlongvit models, we estimate the required resources for \glsxtrshortast models. The resource needs are presented in Tab.12.

MethodParameter overheadTraining time [s]Inference time [ms]
Explicit Ensemble8×87M887M8\times 87\mathrm{M}8 × 87 roman_M5175175175178×7.387.38\times 7.38 × 7.3
\glsxtrshortlora-Ensemble1.08×87M1.0887M1.08\times 87\mathrm{M}1.08 × 87 roman_M34834834834873.973.973.973.9

The number of parameters is reported for an ensemble of 8 members, with the A𝐴Aitalic_A and B𝐵Bitalic_B matrices in models using \glsxtrshortlora having a rank of 16. Training and inference times were measured on a single NVIDIA Tesla A100-80GB \glsxtrshortgpu, with a batch size of 1. Training time is given as the average wall clock time per training epoch while training on ESC-50, with 8 ensemble members. Inference time is reported as the average time for a single forward pass of an ESC-50 sample with a batch size of 1.

As mentioned in Sec.2.1, the Explicit Ensemble processes the members sequentially, while \glsxtrshortlora-Ensemble is parallelized. However, fully parallelizing the training of \glsxtrshortast models causes memory issues, so chunking was introduced. Thus, in \glsxtrshortlora-Ensemble models, the pass through the backbone runs in parallel, while \glsxtrshortlora modules are called sequentially. This also explains the significantly higher inference time compared to the results in Sec.3.1.Additionally, the one-time delay incurred by PyTorch’s vmap function causes \glsxtrshortlora-Ensemble to be slightly slower at inference time.

Appendix K Hyperparameter Tuning for MC Dropout

We conducted an analysis to determine the impact of dropout probability on the accuracy and calibration of the \glsxtrshortvit with Monte Carlo dropout. Fig.10 displays the accuracy and \glsxtrshortece scores for various dropout probabilities. The experiment is carried out on the HAM10000 dataset with 16 members. Our findings show that a dropout probability of 0.20.20.20.2 offers a good balance between accuracy and calibration.

LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (16)

Appendix L Snapshot Ensemble Implementation details

Snapshot Ensemble Huang etal. [2017], in its pure form, consists of training a single model with cycling learning and taking snapshots every few epochs. This can make it hard, however, for the model to converge to anything meaningful within the low number of epochs available for training per snapshot. Therefore, Snapshot Ensemble was modified slightly, by first letting training run for a number of epochs, without any cycling of the learning rate. After this burn-in period the learning rate is at 0 and a first snapshot is taken. The remaining number of epochs is split evenly. If the remaining number of epochs is not divisible by the desired number of ensemble members, the burn-in period is extended until it is. For the HAM10000 dataset training is left at 65 epochs, with 20 burn-in epochs. For CIFAR-10 and CIFAR-100 using only 16 epochs would only leave 1 epoch per cycle for bigger models. Therefore, training is extended to 30 epochs with a burn-in period of 15 epochs.

Appendix M Implicit Ensemble Baseline Challenge

Many implicit ensemble methods, such as those proposed in Wen etal. [2020], Turkoglu etal. [2022], Durasov etal. [2020], Havasi etal. [2020], are architecture-specific and predominantly designed for MLPs or CNNs. As a result, adapting these techniques to transformer architectures presents significant challenges, since transformers’ computation structure is quite different than MLPs and CNNs.

In particular, we attempted to implement FiLM-EnsembleTurkoglu etal. [2022] on a self-attention network, given the promising results reported by its authors. However, the authors themselves noted that applying FiLM-Ensemble to transformers is not straightforward, mainly because transformers rely on LayerNorm, whereas FiLM-Ensemble was developed with BatchNorm in mind. Our experiments confirmed that directly using BatchNorm in transformers led to notable performance degradation. We explored several approaches to adapt LayerNorm, but the most effective results were achieved by fixing all affine parameters for each ensemble member. This allowed for slight initial variations to introduce randomness and diversity, while keeping the variation among members minimal. The results, summarized in Tab.13, show that increasing the ensemble size slightly improved accuracy, though the Expected Calibration Error (ECE) fluctuated without consistent improvement. In fact, when using larger ensemble sizes, such as 8 or 16, both accuracy and calibration worsened across all settings we tested.

# ensemble membersAccuracy (\uparrow)ECE (\downarrow)
190.540.0286
291.180.0269
491.230.0289

Appendix N Definitions of Evaluation Metrics

We primarily evaluate our models on accuracy and Expected Calibration Error [ECE, Guo etal., 2017]. In addition to accuracy and \glsxtrlongece, we have calculated several other scores that have been used in the context of probabilistic deep learning. In the following section, we present the formulations used in our implementations.

N.1 Accuracy

The accuracy is implemented instance-wise as follows:

Acc=1Ni=1N|y^iyi||y^iyi|Acc1𝑁superscriptsubscript𝑖1𝑁subscript^𝑦𝑖subscript𝑦𝑖subscript^𝑦𝑖subscript𝑦𝑖\mathrm{Acc}=\frac{1}{N}\sum_{i=1}^{N}\frac{\lvert\hat{y}_{i}\cap y_{i}\rvert}%{\lvert\hat{y}_{i}\cup y_{i}\rvert}roman_Acc = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG(8)

Here yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the true label of the sample i𝑖iitalic_i, y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the predicted label of the sample i𝑖iitalic_i, and N𝑁Nitalic_N means the total number of samples.

N.2 Expected Calibration Error

The \glsxtrlongece is a widely used metric for measuring the calibration of neural networks. We use the definition given in Guo etal. [2017]. \glsxtrshortece is defined as the expected difference between accuracy and confidence across several bins. We first need to define accuracy and confidence per bin Bmsubscript𝐵𝑚B_{m}italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as follows:

Acc(Bm)Accsubscript𝐵𝑚\displaystyle\mathrm{Acc}(B_{m})roman_Acc ( italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )=1|Bm|iBm𝟏(y^i=yi),absent1subscript𝐵𝑚subscript𝑖subscript𝐵𝑚1subscript^𝑦𝑖subscript𝑦𝑖\displaystyle=\frac{1}{\lvert B_{m}\rvert}\sum_{i\in B_{m}}\mathbf{1}(\hat{y}_%{i}=y_{i}),= divide start_ARG 1 end_ARG start_ARG | italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_1 ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,(9)
Conf(Bm)Confsubscript𝐵𝑚\displaystyle\mathrm{Conf}(B_{m})roman_Conf ( italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )=1|Bm|iBmp^i.absent1subscript𝐵𝑚subscript𝑖subscript𝐵𝑚subscript^𝑝𝑖\displaystyle=\frac{1}{\lvert B_{m}\rvert}\sum_{i\in B_{m}}\hat{p}_{i}.= divide start_ARG 1 end_ARG start_ARG | italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .(10)

Again, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the true and predicted labels of sample i𝑖iitalic_i respectively, and p^isubscript^𝑝𝑖\hat{p}_{i}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the predicted confidence of sample i𝑖iitalic_i.With this the \glsxtrlongece is given as:

ECE=m=1M|Bm|n|Acc(Bm)Conf(Bm)|ECEsuperscriptsubscript𝑚1𝑀subscript𝐵𝑚𝑛Accsubscript𝐵𝑚Confsubscript𝐵𝑚\mathrm{ECE}=\sum_{m=1}^{M}\frac{\lvert B_{m}\rvert}{n}\lvert\mathrm{Acc}(B_{m%})-\mathrm{Conf}(B_{m})\rvertroman_ECE = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG | italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | end_ARG start_ARG italic_n end_ARG | roman_Acc ( italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) - roman_Conf ( italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) |(11)

N.3 Macro F1-score

F1=1Cj=1C2pjrjpj+rj,𝐹11𝐶superscriptsubscript𝑗1𝐶2subscript𝑝𝑗subscript𝑟𝑗subscript𝑝𝑗subscript𝑟𝑗F1=\frac{1}{C}\sum_{j=1}^{C}\frac{2p_{j}r_{j}}{p_{j}+r_{j}},italic_F 1 = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT divide start_ARG 2 italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ,(12)

where rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents the Recall of class j𝑗jitalic_j, defined as rj=TPTP+FNsubscript𝑟𝑗𝑇𝑃𝑇𝑃𝐹𝑁r_{j}=\frac{TP}{TP+FN}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N end_ARG, and pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents the Precision of class j𝑗jitalic_j, defined as pj=TPTP+FPsubscript𝑝𝑗𝑇𝑃𝑇𝑃𝐹𝑃p_{j}=\frac{TP}{TP+FP}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P end_ARG, and C𝐶Citalic_C refers to the number of classes, Here, TP𝑇𝑃TPitalic_T italic_P, FP𝐹𝑃FPitalic_F italic_P, and FN𝐹𝑁FNitalic_F italic_N denote True Positives, False Positives, and False Negatives respectively.

N.4 Negative Log-Likelihood (NLL)

NLL=1Ni=1Nj=1C(yi,jlogp^i,j)=1Ni=1Nlogp^i,𝑁𝐿𝐿1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝐶subscript𝑦𝑖𝑗subscript^𝑝𝑖𝑗1𝑁superscriptsubscript𝑖1𝑁subscript^𝑝𝑖NLL=-\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{C}\left(y_{i,j}\log\hat{p}_{i,j}%\right)=-\frac{1}{N}\sum_{i=1}^{N}\log\hat{p}_{i},italic_N italic_L italic_L = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT roman_log over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,(13)

where N𝑁Nitalic_N denotes the number of datapoints, C𝐶Citalic_C the number of classes, yi,jsubscript𝑦𝑖𝑗y_{i,j}italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is 1 if the true label of point i𝑖iitalic_i is j𝑗jitalic_j and 0 otherwise and p^i,jsubscript^𝑝𝑖𝑗\hat{p}_{i,j}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the predicted probability of sample i𝑖iitalic_i belonging to class j𝑗jitalic_j.

N.5 Brier score

For Brier score we take the definition by Brier [1950], which is as follows:

BS=1Ni=1Nj=1C(p^i,jyi,j)2,𝐵𝑆1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝐶superscriptsubscript^𝑝𝑖𝑗subscript𝑦𝑖𝑗2BS=\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{C}(\hat{p}_{i,j}-y_{i,j})^{2},italic_B italic_S = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(14)

where N𝑁Nitalic_N denotes the number of datapoints, C𝐶Citalic_C the number of classes, yi,jsubscript𝑦𝑖𝑗y_{i,j}italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is 1 if the true label of point i𝑖iitalic_i is j𝑗jitalic_j and zero otherwise and p^i,jsubscript^𝑝𝑖𝑗\hat{p}_{i,j}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the predicted probability of sample i𝑖iitalic_i belonging to class j𝑗jitalic_j.

N.6 Area Under the Receiver Operating Characteristic Curve (AUROC)

The AUROC score evaluates the performance of a binary classifier by measuring its ability to distinguish between positive and negative classes, as introduced by Hanley and McNeil [1982]. In our out-of-distribution (OOD) detection experiments, the positive class corresponds to an in-distribution sample, while the negative class corresponds to an out-of-distribution sample.

The AUROC is computed as the area under the ROC curve, which plots the true positive rate (TPR) against the false positive rate (FPR) across various decision thresholds. The TPR and FPR are defined as follows:

TPRTPR\displaystyle\mathrm{TPR}roman_TPR=TPTP+FN,absentTPTPFN\displaystyle=\frac{\text{TP}}{\text{TP}+\text{FN}},= divide start_ARG TP end_ARG start_ARG TP + FN end_ARG ,(15)
FPRFPR\displaystyle\mathrm{FPR}roman_FPR=FPFP+TN,absentFPFPTN\displaystyle=\frac{\text{FP}}{\text{FP}+\text{TN}},= divide start_ARG FP end_ARG start_ARG FP + TN end_ARG ,(16)

where TP, FP, FN, and TN represent the true positives, false positives, false negatives, and true negatives, respectively.

The AUROC score is given by the following integral:

AUROC=01TPR(FPR),dFPR.AUROCsuperscriptsubscript01TPRFPR𝑑FPR\mathrm{AUROC}=\int_{0}^{1}\mathrm{TPR}(\mathrm{FPR}),d\mathrm{FPR}.roman_AUROC = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT roman_TPR ( roman_FPR ) , italic_d roman_FPR .(17)

A higher AUROC score indicates better classification performance, with a score of 1 representing a perfect classifier, and a score of 0.5 indicating performance equivalent to random chance.

N.7 Area Under the Precision-Recall Curve (AUPRC)

The Area Under the Precision-Recall Curve (AUPRC) assesses the performance of a binary classifier by measuring its ability to accurately identify positive instances, as described by Davis and Goadrich [2006]. In our out-of-distribution (OOD) detection experiments, the positive class corresponds to in-distribution samples, while the negative class corresponds to out-of-distribution samples.

The AUPRC is calculated as the area under the Precision-Recall (PR) curve, which plots precision against recall at various decision thresholds. Precision and recall are defined as follows:

Precision=TPTP+FP,PrecisionTPTPFP\mathrm{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}},roman_Precision = divide start_ARG TP end_ARG start_ARG TP + FP end_ARG ,
Recall=TPTP+FN,RecallTPTPFN\mathrm{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}},roman_Recall = divide start_ARG TP end_ARG start_ARG TP + FN end_ARG ,

where TP, FP, and FN represent true positives, false positives, and false negatives, respectively.

The AUPRC score is the integral of precision with respect to recall, expressed as:

AUPRC=01Precision(Recall)𝑑Recall.AUPRCsuperscriptsubscript01PrecisionRecalldifferential-dRecall\mathrm{AUPRC}=\int_{0}^{1}\mathrm{Precision}(\mathrm{Recall})\,d\mathrm{%Recall}.roman_AUPRC = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT roman_Precision ( roman_Recall ) italic_d roman_Recall .

A higher AUPRC score indicates better classifier performance in recognizing positive instances, with a score near 1 representing a good classifier, characterized by both high recall and high precision. This metric is especially valuable for evaluating classifiers on imbalanced datasets.

LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Ouida Strosin DO

Last Updated:

Views: 5337

Rating: 4.6 / 5 (56 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Ouida Strosin DO

Birthday: 1995-04-27

Address: Suite 927 930 Kilback Radial, Candidaville, TN 87795

Phone: +8561498978366

Job: Legacy Manufacturing Specialist

Hobby: Singing, Mountain biking, Water sports, Water sports, Taxidermy, Polo, Pet

Introduction: My name is Ouida Strosin DO, I am a precious, combative, spotless, modern, spotless, beautiful, precious person who loves writing and wants to share my knowledge and understanding with you.