Browsing by Author "Silva, A. Pedro Duarte"
Now showing 1 - 10 of 12
Results Per Page
Sort Options
- Discriminant analysis of interval data: an assessment of parametric and distance-based approachesPublication . Silva, A. Pedro Duarte; Brito, PaulaBuilding on probabilistic models for interval-valued variables, parametric classification rules, based on Normal or Skew-Normal distributions, are derived for interval data. The performance of such rules is then compared with distancebased methods previously investigated. The results show that Gaussian parametric approaches outperform Skew-Normal parametric and distance-based ones in most conditions analyzed. In particular, with heterocedastic data a quadratic Gaussian rule always performs best. Moreover, restricted cases of the variance-covariance matrix lead to parsimonious rules which for small training samples in heterocedastic problems can outperform unrestricted quadratic rules, even in some cases where the model assumed by these rules is not true. These restrictions take into account the particular nature of interval data, where observations are defined by both MidPoints and Ranges, which may or may not be correlated. Under homocedastic conditions linear Gaussian rules are often the best rules, but distance-based methods may perform better in very specific conditions.
- Efficient screening of variable subsets in multivariate statistical modelsPublication . Silva, A. Pedro Duarte
- Exact and heuristic algorithms for variable selection: Extended Leaps and BoundsPublication . Silva, A. Pedro DuarteAn implementation of enhanced versions of the classical Leaps and Bounds algorithm for variable selection is provided. Features of this implementation include: (i) The availability of general routines capable of handling many different statistical methodologies and comparison criteria. (ii) Routines designed for exact and heuristic searches. (iii) The possibility of dealing with problems with more variables than observations. The implementation is supplied in two different ways: i) as a C++ library with abstract classes that can be specialized to different problems and criteria. ii) as a console application ready to be applied to searches according to some of the most important comparison criteria proposed to date. The code of the C++ library and console application described here, can be freely obtained by sending an email to the author
- Linear discriminant analysis with more variables than observations: a not so naive approachPublication . Silva, A. Pedro DuarteA new linear discrimination rule, designed for two-group problems with many correlated variables, is proposed. This proposal tries to incorporate the most important patterns revealed by the empirical correlations while approximating the optimal Bayes rule as the number of variables grows without limit. In order to achieve this goal the new rule relies on covariance matrix estimates derived from Gaussian factor models with small intrinsic dimensionality. Asymptotic results show that, when the model assumed for the covariance matrix estimate is a reasonable approximation to the true data generating process, the expected error rate of the new rule converges to an error close to that of the optimal Bayes rule, even in several cases where the number of variables grows faster than the number of observations. Simulation results suggest that the new rule clearly outperforms both Fisher's and Naive linear discriminant rules in the data conditions it was designed for.
- Modelling interval data with Normal and Skew-Normal distributionsPublication . Brito, Paula; Silva, A. Pedro DuarteA parametric modelling for interval data is proposed, assuming a multivariate Normal or Skew-Normal distribution for the midpoints and log-ranges of the interval variables. The intrinsic nature of the interval variables leads to special structures of the variance–covariance matrix, which is represented by five different possible configurations. Maximum likelihood estimation for both models under all considered configurations is studied. The proposed modelling is then considered in the context of analysis of variance and multivariate analysis of variance testing. To access the behaviour of the proposed methodology, a simulation study is performed. The results show that, for medium or large sample sizes, tests have good power and their true significance level approaches nominal levels when the constraints assumed for the model are respected; however, for small samples, sizes close to nominal levels cannot be guaranteed. Applications to Chinese meteorological data in three different regions and to credit card usage variables for different card designations, illustrate the proposed methodology.
- New skills in symbolic data analysis for official statisticsPublication . Verde, Rosanna; Batagelj, Vladimir; Brito, Paula; Silva, A. Pedro Duarte; Korenjak-Černe, Simona; Dobša, Jasminka; Diday, EdwinThe paper draws attention to the use of Symbolic Data Analysis (SDA) in the field of Official Statistics. It is composed of three sections presenting three pilot techniques in the field of SDA. The three contributions range from a technique based on the notion of exactly unified summaries for the creation of symbolic objects, a model-based approach for interval data as an innovative parametric strategy in this context, and measures of similarity defined between a class and a collection of classes based on the frequency of the categories which characterize them. The paper shows the effectiveness of the proposed approaches as prototypes of numerous techniques developed within the SDA framework and opens to possible further developments.
- A novel algorithm for simultaneous SNP selection in high-dimensional genome-wide association studiesPublication . Zuber, Verena; Silva, A. Pedro Duarte; Strimmer, KorbinianBackground: Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that needs to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed for SNP selection that consider all markers simultaneously and thus incorporate dependencies among SNPs. Results: We develop a novel multivariate algorithm for large scale SNP selection using CAR score regression, a promising new approach for prioritizing biomarkers. Specifically, we propose a computationally efficient procedure for shrinkage estimation of CAR scores from high-dimensional data. Subsequently, we conduct a comprehensive comparison study including five advanced regression approaches (boosting, lasso, NEG, MCP, and CAR score) and a univariate approach (marginal correlation) to determine the effectiveness in finding true causal SNPs. Conclusions: Simultaneous SNP selection is a challenging task. We demonstrate that our CAR score-based algorithm consistently outperforms all competing approaches, both uni- and multivariate, in terms of correctly recovered causal SNPs and SNP ranking. An R package implementing the approach as well as R code to reproduce the complete study presented here is available from http://strimmerlab.org/software/care/.
- Optimization approaches to supervised classificationPublication . Silva, A. Pedro DuarteThe Supervised Classification problem, one of the oldest and most recurrent problems in applied data analysis, has always been analyzed from many different perspectives. When the emphasis is placed on its overall goal of developing classification rules with minimal classification cost, Supervised Classification can be understood as an optimization problem. On the other hand, when the focus is in modeling the uncertainty involved in the classification of future unknown entities, it can be formulated as a statistical problem. Other perspectives that pay particular attention to pattern recognition and machine learning aspects of Supervised Classification have also a long history that has lead to influential insights and dif- ferent methodologies. In this review, two approaches to Supervised Classification strongly related to optimization theory will be discussed and compared. In particular, we will review methodologies based on Mathematical Programming models that optimize observable criteria linked to the true objective of misclassification error (or cost) minimization, and approaches derived from the minimization of known bounds on the true misclassification error. The former approach is known as the Mathematical Programming approach to Supervised Classification, while the latter is in the origin of the well known Classification Support Vector Machines. Throughout the review two-group as well as general multi-group problems will be considered, and the review will conclude with a discussion of the most promising research directions in this area.
- Outlier detection in interval dataPublication . Silva, A. Pedro Duarte; Filzmoser, Peter; Brito, PaulaA multivariate outlier detection method for interval data is proposed that makes use of a parametric approach to model the interval data. The trimmed maximum likelihood principle is adapted in order to robustly estimate the model parameters. A simulation study demonstrates the usefulness of the robust estimates for outlier detection, and new diagnostic plots allow gaining deeper insight into the structure of real world interval data.
- Parametric models for distributional dataPublication . Brito, Paula; Silva, A. Pedro DuarteWe present parametric probabilistic models for numerical distributional variables. The proposed models are based on the representation of each distribution by a location measure and inter-quantile ranges, for given quantiles, thereby characterizing the underlying empirical distributions in a flexible way. Multivariate Normal distributions are assumed for the whole set of indicators, considering alternative structures of the variance–covariance matrix. For all cases, maximum likelihood estimators of the corresponding parameters are derived. This modelling allows for hypothesis testing and multivariate parametric analysis. The proposed framework is applied to Analysis of Variance and parametric Discriminant Analysis of distributional data. A simulation study examines the performance of the proposed models in classification problems under different data conditions. Applications to Internet traffic data and Portuguese official data illustrate the relevance of the proposed approach.