Loading...
6 results
Search Results
Now showing 1 - 6 of 6
- Modelling interval data with Normal and Skew-Normal distributionsPublication . Brito, Paula; Silva, A. Pedro DuarteA parametric modelling for interval data is proposed, assuming a multivariate Normal or Skew-Normal distribution for the midpoints and log-ranges of the interval variables. The intrinsic nature of the interval variables leads to special structures of the variance–covariance matrix, which is represented by five different possible configurations. Maximum likelihood estimation for both models under all considered configurations is studied. The proposed modelling is then considered in the context of analysis of variance and multivariate analysis of variance testing. To access the behaviour of the proposed methodology, a simulation study is performed. The results show that, for medium or large sample sizes, tests have good power and their true significance level approaches nominal levels when the constraints assumed for the model are respected; however, for small samples, sizes close to nominal levels cannot be guaranteed. Applications to Chinese meteorological data in three different regions and to credit card usage variables for different card designations, illustrate the proposed methodology.
- A novel algorithm for simultaneous SNP selection in high-dimensional genome-wide association studiesPublication . Zuber, Verena; Silva, A. Pedro Duarte; Strimmer, KorbinianBackground: Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that needs to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed for SNP selection that consider all markers simultaneously and thus incorporate dependencies among SNPs. Results: We develop a novel multivariate algorithm for large scale SNP selection using CAR score regression, a promising new approach for prioritizing biomarkers. Specifically, we propose a computationally efficient procedure for shrinkage estimation of CAR scores from high-dimensional data. Subsequently, we conduct a comprehensive comparison study including five advanced regression approaches (boosting, lasso, NEG, MCP, and CAR score) and a univariate approach (marginal correlation) to determine the effectiveness in finding true causal SNPs. Conclusions: Simultaneous SNP selection is a challenging task. We demonstrate that our CAR score-based algorithm consistently outperforms all competing approaches, both uni- and multivariate, in terms of correctly recovered causal SNPs and SNP ranking. An R package implementing the approach as well as R code to reproduce the complete study presented here is available from http://strimmerlab.org/software/care/.
- Probabilistic clustering of interval dataPublication . Brito, Paula; Silva, A. Pedro Duarte; Dias, José G.In this paper we address the problem of clustering interval data, adopting a model-based approach. To this purpose, parametric models for interval-valued variables are used which consider configurations for the variance-covariance matrix that take the nature of the interval data directly into account. Results, both on synthetic and empirical data, clearly show the well-founding of the proposed approach. The method succeeds in finding parsimonious heterocedastic models which is a critical feature in many applications. Furthermore, the analysis of the different data sets made clear the need to explicitly consider the intrinsic variability present in interval data.
- Outlier detection in interval dataPublication . Silva, A. Pedro Duarte; Filzmoser, Peter; Brito, PaulaA multivariate outlier detection method for interval data is proposed that makes use of a parametric approach to model the interval data. The trimmed maximum likelihood principle is adapted in order to robustly estimate the model parameters. A simulation study demonstrates the usefulness of the robust estimates for outlier detection, and new diagnostic plots allow gaining deeper insight into the structure of real world interval data.
- Optimization approaches to supervised classificationPublication . Silva, A. Pedro DuarteThe Supervised Classification problem, one of the oldest and most recurrent problems in applied data analysis, has always been analyzed from many different perspectives. When the emphasis is placed on its overall goal of developing classification rules with minimal classification cost, Supervised Classification can be understood as an optimization problem. On the other hand, when the focus is in modeling the uncertainty involved in the classification of future unknown entities, it can be formulated as a statistical problem. Other perspectives that pay particular attention to pattern recognition and machine learning aspects of Supervised Classification have also a long history that has lead to influential insights and dif- ferent methodologies. In this review, two approaches to Supervised Classification strongly related to optimization theory will be discussed and compared. In particular, we will review methodologies based on Mathematical Programming models that optimize observable criteria linked to the true objective of misclassification error (or cost) minimization, and approaches derived from the minimization of known bounds on the true misclassification error. The former approach is known as the Mathematical Programming approach to Supervised Classification, while the latter is in the origin of the well known Classification Support Vector Machines. Throughout the review two-group as well as general multi-group problems will be considered, and the review will conclude with a discussion of the most promising research directions in this area.
- Discriminant analysis of interval data: an assessment of parametric and distance-based approachesPublication . Silva, A. Pedro Duarte; Brito, PaulaBuilding on probabilistic models for interval-valued variables, parametric classification rules, based on Normal or Skew-Normal distributions, are derived for interval data. The performance of such rules is then compared with distancebased methods previously investigated. The results show that Gaussian parametric approaches outperform Skew-Normal parametric and distance-based ones in most conditions analyzed. In particular, with heterocedastic data a quadratic Gaussian rule always performs best. Moreover, restricted cases of the variance-covariance matrix lead to parsimonious rules which for small training samples in heterocedastic problems can outperform unrestricted quadratic rules, even in some cases where the model assumed by these rules is not true. These restrictions take into account the particular nature of interval data, where observations are defined by both MidPoints and Ranges, which may or may not be correlated. Under homocedastic conditions linear Gaussian rules are often the best rules, but distance-based methods may perform better in very specific conditions.