Today's papers in stat.ML cluster around three interconnected methodological frontiers: the extension of classical statistical frameworks into online and decentralized settings, the derivation of finite-sample theoretical guarantees for complex inference procedures, and the integration of structural constraints into learning problems to improve both interpretability and computational efficiency. Online learning variants of survival analysis, bandit-based policy evaluation, and streaming clustering all reflect a sustained effort to adapt semiparametric and Bayesian methods to sequential data regimes where delayed feedback, censoring, and model misspecification pose genuine challenges. Simultaneously, a strong theoretical current runs through work on PAC-Bayes bounds for overparameterized models, symmetry-induced statistic recovery in variational inference, and random-matrix characterizations of early stopping, where the goal is explicit, non-asymptotic analysis that accounts for data structure rather than worst-case metric entropy. A third thread emphasizes structured representations: decentralized learning via Gibbs measures, mixture-of-experts with softmax gating, Tucker decomposition for feature selection, and orthogonal polynomial kernels for SVM interpretability all impose problem-specific geometric or algebraic structure to enable either identifiability, computational tractability, or transparent decision rules. Across these clusters, the papers prioritize precise problem formulation, controlled experimental validation, and theoretical characterization of when and why a method works, rather than benchmark dominance.
Cole Brennan
Showing of papers
Survival analysis is a widely used statistical framework for modeling time-to-event data under censoring. Classical methods, such as the Cox proportional hazards (Cox PH) model, offer a semiparametric approach to estimating the effects of covariates on the hazard function. Despite its importance, survival analysis has been largely unexplored in online settings, particularly within the bandit framework, where decisions must be made sequentially to optimize treatments as new data arrive over time. In this work, we take an initial step toward integrating survival analysis into a purely online learning setting under the Cox PH model, addressing key challenges including staggered entry, delayed feedback, and right censoring. We adapt three canonical bandit algorithms to balance exploration and exploitation, with theoretical guarantees of sublinear regret bounds. Extensive simulations and semi-real experiments using SEER cancer data demonstrate that our approach enables rapid and effective learning of near-optimal treatment policies.
We consider the problem of sampling from a probability distribution $π$. It is well known that this can be written as an optimisation problem over the space of probability distributions in which we aim to minimise the Kullback--Leibler divergence from $π$. We consider the effect of replacing $π$ with a sequence of moving targets $(π_t)_{t\ge0}$ defined via geometric tempering on the Wasserstein and Fisher--Rao gradient flows. We show that convergence occurs exponentially in continuous time, providing novel bounds in both cases. We also consider popular time discretisations and explore their convergence properties. We show that in the Fisher--Rao case, replacing the target distribution with a geometric mixture of initial and target distribution never leads to a convergence speed up both in continuous time and in discrete time. Finally, we explore the gradient flow structure of tempered dynamics and derive novel adaptive tempering schedules.
In this paper, it is shown, for the first time, that centralized performance is achievable in decentralized learning without sharing the local datasets. Specifically, when clients adopt an empirical risk minimization with relative-entropy regularization (ERM-RER) learning framework and a forward-backward communication between clients is established, it suffices to share the locally obtained Gibbs measures to achieve the same performance as that of a centralized ERM-RER with access to all the datasets. The core idea is that the Gibbs measure produced by client~$k$ is used, as reference measure, by client~$k+1$. This effectively establishes a principled way to encode prior information through a reference measure. In particular, achieving centralized performance in the decentralized setting requires a specific scaling of the regularization factors with the local sample sizes. Overall, this result opens the door to novel decentralized learning paradigms that shift the collaboration strategy from sharing data to sharing the local inductive bias via the reference measures over the set of models.
Determining identifiability of causal effects from observational data under latent confounding is a central challenge in causal inference. For linear structural causal models, identifiability of causal effects is decidable through symbolic computation. However, standard approaches based on Gröbner bases become computationally infeasible beyond small settings due to their doubly exponential complexity. In this work, we study how to practically use symbolic computation for deciding rational identifiability. In particular, we present an efficient algorithm that provably finds the lowest degree identifying formulas. For a causal effect of interest, if there exists an identification formula of a prespecified maximal degree, our algorithm returns such a formula in quasi-polynomial time.
Mixture-of-experts models provide a flexible framework for learning complex probabilistic input-output relationships by combining multiple expert models through an input-dependent gating mechanism. These models have become increasingly prominent in modern machine learning, yet their theoretical properties in the Bayesian framework remain largely unexplored. In this paper, we study Bayesian mixture-of-experts models, focusing on the ubiquitous softmax-based gating mechanism. Specifically, we investigate the asymptotic behavior of the posterior distribution for three fundamental statistical tasks: density estimation, parameter estimation, and model selection. First, we establish posterior contraction rates for density estimation, both in the regimes with a fixed, known number of experts and with a random learnable number of experts. We then analyze parameter estimation and derive convergence guarantees based on tailored Voronoi-type losses, which account for the complex identifiability structure of mixture-of-experts models. Finally, we propose and analyze two complementary strategies for selecting the number of experts. Taken together, these results provide one of the first systematic theoretical analyses of Bayesian mixture-of-experts models with softmax gating, and yield several theory-grounded insights for practical model design.
Recovering latent structure from count data has received considerable attention in network inference, particularly when one seeks both cross-group interactions and within-group similarity patterns in bipartite networks, which is widely used in ecology research. Such networks are often sparse and inherently imperfect in their detection. Existing models mainly focus on interaction recovery, while the induced similarity graphs are much less studied. Moreover, sparsity is often not controlled, and scale is unbalanced, leading to oversparse or poorly rescaled estimates with degrading structural recovery. To address these issues, we propose a framework for structured sparse nonnegative low-rank factorization with detection probability estimation. We impose nonconvex $\ell_{1/2}$ regularization on the latent similarity and connectivity structures to promote sparsity within-group similarity and cross-group connectivity with better relative scale. The resulting optimization problem is nonconvex and nonsmooth. To solve it, we develop an ADMM-based algorithm with adaptive penalization and scale-aware initialization and establish its asymptotic feasibility and KKT stationarity of cluster points under mild regularity conditions. Experiments on synthetic and real-world ecological datasets demonstrate improved recovery of latent factors and similarity/connectivity structure relative to existing baselines.
We study finite-horizon continuous-time policy evaluation from discrete closed-loop trajectories under time-inhomogeneous dynamics. The target value surface solves a backward parabolic equation, but the Bellman baseline obtained from one-step recursion is only first-order in the grid width. We estimate the time-dependent generator from multi-step transitions using moment-matching coefficients that cancel lower-order truncation terms, and combine the resulting surrogate with backward regression. The main theory gives an end-to-end decomposition into generator misspecification, projection error, pooling bias, finite-sample error, and start-up error, together with a decision-frequency regime map explaining when higher-order gains should be visible. Across calibration studies, four-scale benchmarks, feature and start-up ablations, and gain-mismatch stress tests, the second-order estimator consistently improves on the Bellman baseline and remains stable in the regime where the theory predicts visible gains. These results position high-order generator regression as an interpretable continuous-time policy-evaluation method with a clear operating region.
Estimating the number of components is a fundamental challenge in unsupervised learning, particularly when dealing with high-dimensional data with many components or severely imbalanced component sizes. This paper addresses this challenge for classical Gaussian mixture models. The proposed estimator is simple: center the data, compute the singular values of the centered matrix, and count those above a threshold. No iterative fitting, no likelihood calculation, and no prior knowledge of the number of components are required. We prove that, under a mild separation condition on the component centers, the estimator consistently recovers the true number of components. The result holds in high-dimensional settings where the dimension can be much larger than the sample size. It also holds when the number of components grows to the smaller of the dimension and the sample size, even under severe imbalance among component sizes. Computationally, the method is extremely fast: for example, it processes ten million samples in one hundred dimensions within one minute. Extensive experimental studies confirm its accuracy in challenging settings such as high dimensionality, many components, and severe class imbalance.
In uncertainty quantification, evaluating sensitivity measures under specific conditions (i.e., conditional Sobol' indices) is essential for systems with parameterized responses, such as spatial fields or varying operating conditions. Traditional approaches often rely on point-wise modeling, which is computationally expensive and may lack consistency across the parameter space. This paper demonstrates that for a pre-trained global Polynomial Chaos Expansion (PCE) model, the analytical conditional Sobol' indices are inherently embedded within its basis functions. By leveraging the tensor-product property of PCE bases, we reformulate the global expansion into a set of analytical coefficient fields that depend on the conditioning variables. Based on the preservation of orthogonality under conditional probability measures, we derive closed-form expressions for conditional variances and Sobol' indices. This framework bypasses the need for repetitive modeling or additional sampling, transforming conditional sensitivity analysis into a purely algebraic post-processing step. Numerical benchmarks indicate that the proposed method ensures physical coherence and offers superior numerical robustness and computational efficiency compared to conventional point-wise approaches.
Converting betting odds into accurate outcome probabilities is a fundamental challenge in order to use betting odds as a benchmark for sports forecasting and market efficiency analysis. In this study, we propose two methods to overcome the limitations of existing conversion methods. Firstly, we propose an odds-only method to convert betting odds to probabilities without using historical data for model fitting. While existing odds-only methods, such as Multiplicative, Shin, and Power exist, they do not adjust for biases or relationships we found in our betting odds dataset, which consists of 90014 football matches across five different bookmakers. To overcome these limitations, our proposed Odds-Only-Equal-Profitability-Confidence (OO-EPC) method aligns with the bookmakers' pricing objectives of having equal confidence in profitability for each outcome. We provide empirical evidence from our betting odds dataset that, for the majority of bookmakers, our proposed OO-EPC method outperforms the existing odds-only methods. Beyond controlled experiments, we applied the OO-EPC method under real-world uncertainty by using it for six iterations of an annual basketball outcome forecasting competition. Secondly, we propose a generalised linear model that utilises historical data for model fitting and then converts betting odds to probabilities. Existing generalised linear models attempt to capture relationships that the Efficient Market Hypothesis already captures. To overcome this shortcoming, our proposed Favourite-Longshot-Bias-Adjusted Generalised Linear Model (FL-GLM) fits just one parameter to capture the favourite-longshot bias, providing a more interpretable alternative. We provide empirical evidence from historical football matches where, for all bookmakers, our proposed FL-GLM outperforms the existing multinomial and logistic generalised linear models.
We derive explicit non-asymptotic PAC-Bayes generalization bounds for Gibbs posteriors, that is, data-dependent distributions over model parameters obtained by exponentially tilting a prior with the empirical risk. Unlike classical worst-case complexity bounds based on uniform laws of large numbers, which require explicit control of the model space in terms of metric entropy (integrals), our analysis yields posterior-averaged risk bounds that can be applied to overparameterized models and adapt to the data structure and the intrinsic model complexity. The bound involves a marginal-type integral over the parameter space, which we analyze using tools from singular learning theory to obtain explicit and practically meaningful characterizations of the posterior risk. Applications to low-rank matrix completion and ReLU neural network regression and classification show that the resulting bounds are analytically tractable and substantially tighter than classical complexity-based bounds. Our results highlight the potential of PAC-Bayes analysis for precise finite-sample generalization guarantees in modern overparameterized and singular models.
This paper proposes StrEBM, a structured latent energy-based model for source-wise structured representation learning. The framework is motivated by a broader goal of promoting identifiable and decoupled latent organization by assigning different latent dimensions their own learnable structural biases, rather than constraining the entire latent representation with a single shared energy. In this sense, blind source separation is adopted here as a concrete and verifiable testbed, through which the evolution of latent dimensions toward distinct underlying components can be directly examined. In the proposed framework, latent trajectories are optimized directly together with an observation-generation map and source-wise structural parameters. Each latent dimension is associated with its own energy-based formulation, allowing different latent components to gradually evolve toward distinct source-like roles during training. In the present study, this source-wise energy design is instantiated using Gaussian-process-inspired energies with learnable length-scales, but the framework itself is not restricted to Gaussian processes and is intended as a more general structured latent EBM formulation. Experiments on synthetic multichannel signals under linear and nonlinear mixing settings show that the proposed model can recover source components effectively, providing an initial empirical validation of the framework. At the same time, the study reveals important optimization characteristics, including slow late-stage convergence and reduced stability under nonlinear observation mappings. These findings not only clarify the practical behavior of the current GP-based instantiation, but also establish a basis for future investigation of richer source-wise energy families and more robust nonlinear optimization strategies.
This paper investigates the off-policy evaluation (OPE) problem from a distributional perspective. Rather than focusing solely on the expectation of the total return, as in most existing OPE methods, we aim to estimate the entire return distribution. To this end, we introduce a quantile-based approach for OPE using deep quantile process regression, presenting a novel algorithm called Deep Quantile Process regression-based Off-Policy Evaluation (DQPOPE). We provide new theoretical insights into the deep quantile process regression technique, extending existing approaches that estimate discrete quantiles to estimate a continuous quantile function. A key contribution of our work is the rigorous sample complexity analysis for distributional OPE with deep neural networks, bridging theoretical analysis with practical algorithmic implementations. We show that DQPOPE achieves statistical advantages by estimating the full return distribution using the same sample size required to estimate a single policy value using conventional methods. Empirical studies further show that DQPOPE provides significantly more precise and robust policy value estimates than standard methods, thereby enhancing the practical applicability and effectiveness of distributional reinforcement learning approaches.
Deep learning (DL) has become a cornerstone of modern machine learning (ML) praxis. We introduce the R package mlr3torch, which is an extensible DL framework for the mlr3 ecosystem. It is built upon the torch package, and simplifies the definition, training, and evaluation of neural networks for both tabular data and generic tensors (e.g., images) for classification and regression. The package implements predefined architectures, and torch models can easily be converted to mlr3 learners. It also allows users to define neural networks as graphs. This representation is based on the graph language defined in mlr3pipelines and allows users to define the entire modeling workflow, including preprocessing, data augmentation, and network architecture, in a single graph. Through its integration into the mlr3 ecosystem, the package allows for convenient resampling, benchmarking, preprocessing, and more. We explain the package's design and features and show how to customize and extend it to new problems. Furthermore, we demonstrate the package's capabilities using three use cases, namely hyperparameter tuning, fine-tuning, and defining architectures for multimodal data. Finally, we present some runtime benchmarks.
Variational inference (VI) is a central tool in modern machine learning, used to approximate an intractable target density by optimising over a tractable family of distributions. As the variational family cannot typically represent the target exactly, guarantees on the quality of the resulting approximation are crucial for understanding which of its properties VI can faithfully capture. Recent work has identified instances in which symmetries of the target and the variational family enable the recovery of certain statistics, even under model misspecification. However, these guarantees are inherently problem-specific and offer little insight into the fundamental mechanism by which symmetry forces statistic recovery. In this paper, we overcome this limitation by developing a general theory of symmetry-induced statistic recovery in variational inference. First, we characterise when variational minimisers inherit the symmetries of the target and establish conditions under which these pin down identifiable statistics. Second, we unify existing results by showing that previously known statistic recovery guarantees in location-scale families arise as special cases of our theory. Third, we apply our framework to distributions on the sphere to obtain novel guarantees for directional statistics in von Mises-Fisher families. Together, these results provide a modular blueprint for deriving new recovery guarantees for VI in a broad range of symmetry settings.
Selection bias arises when the probability that an observation enters a dataset depends on variables related to the quantities of interest, leading to systematic distortions in estimation and uncertainty quantification. For example, in epidemiological or survey settings, individuals with certain outcomes may be more likely to be included, resulting in biased prevalence estimates with potentially substantial downstream impact. Classical corrections, such as inverse-probability weighting or explicit likelihood-based models of the selection process, rely on tractable likelihoods, which limits their applicability in complex stochastic models with latent dynamics or high-dimensional structure. Simulation-based inference enables Bayesian analysis without tractable likelihoods but typically assumes missingness at random and thus fails when selection depends on unobserved outcomes or covariates. Here, we develop a bias-aware simulation-based inference framework that explicitly incorporates selection into neural posterior estimation. By embedding the selection mechanism directly into the generative simulator, the approach enables amortized Bayesian inference without requiring tractable likelihoods. This recasting of selection bias as part of the simulation process allows us to both obtain debiased estimates and explicitly test for the presence of bias. The framework integrates diagnostics to detect discrepancies between simulated and observed data and to assess posterior calibration. The method recovers well-calibrated posterior distributions across three statistical applications with diverse selection mechanisms, including settings in which likelihood-based approaches yield biased estimates. These results recast the correction of selection bias as a simulation problem and establish simulation-based inference as a practical and testable strategy for parameter estimation under selection bias.
Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each item we can recommend is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose two algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on real-world content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of nodes evaluations.
Empirical studies of trained models often report a transient regime in which signal is detectable in a finite gradient descent time window before overfitting dominates. We provide an analytically tractable random-matrix model that reproduces this phenomenon for gradient flow in a linear teacher--student setting. In this framework, learning occurs when an isolated eigenvalue separates from a noisy bulk, before eventually disappearing in the overfitting regime. The key ingredient is anisotropy in the input covariance, which induces fast and slow directions in the learning dynamics. In a two-block covariance model, we derive the full time-dependent bulk spectrum of the symmetrized weight matrix through a $2\times 2$ Dyson equation, and we obtain an explicit outlier condition for a rank-one teacher via a rank-two determinant formula. This yields a transient Baik-Ben Arous-Péché (BBP) transition: depending on signal strength and covariance anisotropy, the teacher spike may never emerge, emerge and persist, or emerge only during an intermediate time interval before being reabsorbed into the bulk. We map the corresponding phase diagrams and validate the theory against finite-size simulations. Our results provide a minimal solvable mechanism for early stopping as a transient spectral effect driven by anisotropy and noise.
Verification of model outputs is rapidly emerging as a key primitive for both training and real-world deployment of large language models (LLMs). In practice, this often involves using imperfect LLM judges and reward models since ground truth acquisition can be time-consuming and expensive. We introduce Fully Unsupervised Score Ensembling (FUSE), a method for improving verification quality by ensembling verifiers without access to ground truth correctness labels. The key idea behind FUSE is to control conditional dependencies between verifiers in a manner that improves the unsupervised performance of a class of spectral algorithms from the ensembling literature. Despite requiring zero ground truth labels, FUSE typically matches or improves upon semi-supervised alternatives in test-time scaling experiments with diverse sets of generator models, verifiers, and benchmarks. In particular, we validate our method on both conventional academic benchmarks such as GPQA Diamond and on frontier, unsaturated benchmarks such as Humanity's Last Exam and IMO Shortlist questions.
In this work, we revisit the problem of active sequential prediction-powered mean estimation, where at each round one must decide the query probability of the ground-truth label upon observing the covariates of a sample. Furthermore, if the label is not queried, the prediction from a machine learning model is used instead. Prior work proposed an elegant scheme that determines the query probability by combining an uncertainty-based suggestion with a constant probability that encodes a soft constraint on the query probability. We explored different values of the mixing parameter and observed an intriguing empirical pattern: the smallest confidence width tends to occur when the weight on the constant probability is close to one, thereby reducing the influence of the uncertainty-based component. Motivated by this observation, we develop a non-asymptotic analysis of the estimator and establish a data-dependent bound on its confidence interval. Our analysis further suggests that when a no-regret learning approach is used to determine the query probability and control this bound, the query probability converges to the constraint of the max value of the query probability when it is chosen obliviously to the current covariates. We also conduct simulations that corroborate these theoretical findings.
In multi-fidelity optimization, biased approximations of varying costs of the target function are available. This paper studies the problem of optimizing a locally smooth function with a limited budget, where the learner has to make a tradeoff between the cost and the bias of these approximations. We first prove lower bounds for the simple regret under different assumptions on the fidelities, based on a cost-to-bias function. We then present the Kometo algorithm which achieves, with additional logarithmic factors, the same rates without any knowledge of the function smoothness and fidelity assumptions, and improves previously proven guarantees. We finally empirically show that our algorithm outperforms previous multi-fidelity optimization methods without the knowledge of problem-dependent parameters.
We derive a robust update rule for the online infinite hidden Markov model (iHMM) for when the streaming data contains outliers and the model is misspecified. Leveraging recent advances in generalised Bayesian inference, we define robustness via the posterior influence function (PIF), and provide conditions under which the online iHMM has bounded PIF. Imposing robustness inevitably induces an adaptation lag for regime switching. Our method, which is called Batched Robust iHMM (BR-iHMM), balances adaptivity and robustness with two additional tunable parameters. Across limit order book data, hourly electricity demand, and a synthetic high-dimensional linear system, BR-iHMM reduces one-step-ahead forecasting error by up to 67% relative to competing online Bayesian methods. Together with theoretical guarantees of bounded PIF, our results highlight the practicality of our approach for both forecasting and interpretable online learning.
Conformal prediction (CP) has attracted broad attention as a simple and flexible framework for uncertainty quantification through prediction sets. In this work, we study how to deploy CP under differential privacy (DP) in a statistically efficient manner. We first introduce differential CP, a non-splitting conformal procedure that avoids the efficiency loss caused by data splitting and serves as a bridge between oracle CP and private conformal inference. By exploiting the stability properties of DP mechanisms, differential CP establishes a direct connection to oracle CP and inherits corresponding validity behavior. Building on this idea, we develop Differentially Private Conformal Prediction (DPCP), a fully private procedure that combines DP model training with a private quantile mechanism for calibration. We establish the end-to-end privacy guarantee of DPCP and investigate its coverage properties under additional regularity conditions. We further study the efficiency of both differential CP and DPCP under empirical risk minimization and general regression models, showing that DPCP can produce tighter prediction sets than existing private split conformal approaches under the same privacy budget. Numerical experiments on synthetic and real datasets demonstrate the practical effectiveness of the proposed methods.
We study a classification problem with three key challenges: pervasive informative missingness, the integration of partial prior expert knowledge into the learning process, and the need for interpretable decision rules. We propose a framework that encodes prior knowledge through an expert-guided class-conditional model for one or more classes, and use this model to construct a small set of interpretable goodness-of-fit features. The features quantify how well the observed data agree with the expert model, isolating the contributions of different aspects of the data, including both observed and missing components. These features are combined with a few transparent auxiliary summaries in a simple discriminative classifier, resulting in a decision rule that is easy to inspect and justify. We develop and apply the framework in the context of seismic monitoring used to assess compliance with the Comprehensive Nuclear-Test-Ban Treaty. We show that the method has strong potential as a transparent screening tool, reducing workload for expert analysts. A simulation designed to isolate the contribution of the proposed framework shows that this interpretable expert-guided method can even outperform strong standard machine-learning classifiers, particularly when training samples are small.
In online clustering problems, there is often a large amount of uncertainty over possible cluster assignments that cannot be resolved until more data are observed. This difficulty is compounded when clusters follow complex distributions, as is the case with text data. Sequential Monte Carlo (SMC) methods give a natural way of representing and updating this uncertainty over time, but have prohibitive memory requirements for large-scale problems. We propose a novel SMC algorithm that decomposes clustering problems into approximately independent subproblems, allowing a more compact representation of the algorithm state. Our approach is motivated by the knowledge base construction problem, and we show that our method is able to accurately and efficiently solve clustering problems in this setting and others where traditional SMC struggles.
We study bandit best-arm identification with arbitrary and potentially adversarial rewards. A simple random uniform learner obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is suboptimal when the rewards are sampled stochastically. Therefore, we ask: Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards? First, we show that designing such a learner is impossible in general. In particular, to be robust to adversarial rewards, we can only guarantee optimal rates of error on a subset of the stochastic problems. We give a lower bound that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards. Finally, we design a simple parameter-free algorithm and show that its probability of error matches (up to log factors) the lower bound in stochastic problems, and it is also robust to adversarial ones.
In this paper, we proposed Bayesian Tucker decomposition (BTuD) in which residual is supposed to obey Gaussian distribution analogous to linear regression. Although we have proposed an algorithm to perform the proposed BTuD, the conventional higher-order orthogonal iteration can generate Tucker decomposition consistent with the present implementation. Using the proposed BTuD, we can perform unsupervised feature selection successfully applied to various synthetic datasets, global coupled maps with randomized coupling strength, and gene expression profiles. Thus we can conclude that our newly proposed unsupervised feature selection method is promising. In addition to this, BTuD based unsupervised FE is expected to coincide with TD based unsupervised FE that were previously proposed and successfully applied to a wide range of problems.
Feature selection is a classical problem in statistics and machine learning, and it continues to remain an extremely challenging problem especially in the context of unknown non-linear relationships with dependent features. On the other hand, Shapley values are a classic solution concept from cooperative game theory that is widely used for feature attribution in general non-linear models with highly-dependent features. However, Shapley values are not naturally suited for feature selection since they tend to capture both direct effects from each feature to the response and indirect effects through other features. In this paper, we combine the advantages of Shapley values and adapt them to feature selection by proposing \emph{MinShap}, a modification of the Shapley value framework along with a suite of other related algorithms. In particular for MinShap, instead of taking the average marginal contributions over permutations of features, considers the minimum marginal contribution across permutations. We provide a theoretical foundation motivated by the faithfulness assumption in DAG (directed acyclic graphical models), a guarantee for the Type I error of MinShap, and show through numerical simulations and real data experiments that MinShap tends to outperform state-of-the-art feature selection algorithms such as LOCO, GCM and Lasso in terms of both accuracy and stability. We also introduce a suite of algorithms related to MinShap by using the multiple testing/p-value perspective that improves performance in lower-sample settings and provide supporting theoretical guarantees.
We propose a novel amortized optimization method for predicting optimal transport (OT) plans across multiple pairs of measures by leveraging Kantorovich potentials derived from sliced OT. We introduce two amortization strategies: regression-based amortization (RA-OT) and objective-based amortization (OA-OT). In RA-OT, we formulate a functional regression model that treats Kantorovich potentials from the original OT problem as responses and those obtained from sliced OT as predictors, and estimate these models via least-squares methods. In OA-OT, we estimate the parameters of the functional model by optimizing the Kantorovich dual objective. In both approaches, the predicted OT plan is subsequently recovered from the estimated potentials. As amortized OT methods, both RA-OT and OA-OT enable efficient solutions to repeated OT problems across different measure pairs by reusing information learned from prior instances to rapidly approximate new solutions. Moreover, by exploiting the structure provided by sliced OT, the proposed models are more parsimonious, independent of specific structures of the measures, such as the number of atoms in the discrete case, while achieving high accuracy. We demonstrate the effectiveness of our approaches on tasks including MNIST digit transport, color transfer, supply-demand transportation on spherical data, and mini-batch OT conditional flow matching.
We study post-training interpretability for Support Vector Machines (SVMs) built from truncated orthogonal polynomial kernels. Since the associated reproducing kernel Hilbert space is finite-dimensional and admits an explicit tensor-product orthonormal basis, the fitted decision function can be expanded exactly in intrinsic RKHS coordinates. This leads to Orthogonal Representation Contribution Analysis (ORCA), a diagnostic framework based on normalized Orthogonal Kernel Contribution (OKC) indices. These indices quantify how the squared RKHS norm of the classifier is distributed across interaction orders, total polynomial degrees, marginal coordinate effects, and pairwise contributions. The methodology is fully post-training and requires neither surrogate models nor retraining. We illustrate its diagnostic value on a synthetic double-spiral problem and on a real five-dimensional echocardiogram dataset. The results show that the proposed indices reveal structural aspects of model complexity that are not captured by predictive accuracy alone.