Browsing by Author "Brito, Paula"
Now showing 1 - 10 of 10
Results Per Page
Sort Options
- Discriminant analysis of interval data: an assessment of parametric and distance-based approachesPublication . Silva, A. Pedro Duarte; Brito, PaulaBuilding on probabilistic models for interval-valued variables, parametric classification rules, based on Normal or Skew-Normal distributions, are derived for interval data. The performance of such rules is then compared with distancebased methods previously investigated. The results show that Gaussian parametric approaches outperform Skew-Normal parametric and distance-based ones in most conditions analyzed. In particular, with heterocedastic data a quadratic Gaussian rule always performs best. Moreover, restricted cases of the variance-covariance matrix lead to parsimonious rules which for small training samples in heterocedastic problems can outperform unrestricted quadratic rules, even in some cases where the model assumed by these rules is not true. These restrictions take into account the particular nature of interval data, where observations are defined by both MidPoints and Ranges, which may or may not be correlated. Under homocedastic conditions linear Gaussian rules are often the best rules, but distance-based methods may perform better in very specific conditions.
- Discriminant Analysis of Interval Data: Parametric Versus Distance-Based ApproachesPublication . Duarte Silva, A. P.; Brito, Paula
- Identifying Special Structures in Interval-Data via Model-Base ClusteringPublication . Brito, Paula; Duarte Silva, A. P.; Dias, José G.In this paper we present a model-based approach to the clustering of interval data building on recently proposed parametric models. These methods consider configurations for the variance-covariance matrix that take the nature of the interval data directly into account. The proposed framework relies on parametrizations considering the inherent variability of the relevant data units and the relation that may exist between this variability and the corresponding value levels. Using both synthetic and real data sets the pertinence of the proposed methodology is shown, as the method effectively selects heterocedastic models with restricted covariance structures when they are the most suitable, even in situations with limited information. Moreover, considering special configurations of the variance-covariance matrix, adapted to nature of interval data, proves to be the adequate approach. The presented study also makes clear the need to consider both the information about position (conveyed by the MidPoints) and intrinsic variability (conveyed by the Log-Ranges) when analysing interval data.
- MAINT.Data: modelling and analysing interval data in RPublication . Silva, Pedro Duarte; Brito, Paula; Filzmoser, Peter; Dias, JoséWe present the CRAN R package MAINT.Data for the modelling and analysis of multivariate interval data, i.e., where units are described by variables whose values are intervals of IR, representing intrinsic variability. Parametric inference methodologies based on probabilistic models for interval variables have been developed, where each interval is represented by its midpoint and log-range, for which multivariate Normal and Skew-Normal distributions are assumed. The intrinsic nature of the interval variables leads to special structures of the variance-covariance matrix, which are represented by four different possible configurations. MAINT.Data implements the proposed methodologies in the S4 object system, introducing a specific data class for representing interval data. It includes functions and methods for modelling and analysing interval data, in particular maximum likelihood estimation, statistical tests for the different configurations, (M)ANOVA and Discriminant Analysis. For the Gaussian model, Model-based Clustering, robust estimation, outlier detection and Robust Discriminant Analysis are also available
- Modelling interval data with Normal and Skew-Normal distributionsPublication . Brito, Paula; Silva, A. Pedro DuarteA parametric modelling for interval data is proposed, assuming a multivariate Normal or Skew-Normal distribution for the midpoints and log-ranges of the interval variables. The intrinsic nature of the interval variables leads to special structures of the variance–covariance matrix, which is represented by five different possible configurations. Maximum likelihood estimation for both models under all considered configurations is studied. The proposed modelling is then considered in the context of analysis of variance and multivariate analysis of variance testing. To access the behaviour of the proposed methodology, a simulation study is performed. The results show that, for medium or large sample sizes, tests have good power and their true significance level approaches nominal levels when the constraints assumed for the model are respected; however, for small samples, sizes close to nominal levels cannot be guaranteed. Applications to Chinese meteorological data in three different regions and to credit card usage variables for different card designations, illustrate the proposed methodology.
- Multivariate Parametric Analysis of Interval DataPublication . Brito, Paula; Duarte Silva, A. P.; Dias, José G.This work focuses on the study of interval data, i.e., when the variables’ values are intervals of IR, using parametric probabilistic models previously proposed. These models are based on the representation of each observed interval by its MidPoint and LogRange for which multivariate Normal and Skew-Normal distributions are assumed, considering different structures of the variance-covariance matrix. The proposed modelling has been applied to different multivariate methodologies - (M)ANOVA, discriminant analysis, model-based clustering - that are presented and discussed. The R-package MAINT.Data, available on CRAN, implements models and methods for the Gaussian case.
- New skills in symbolic data analysis for official statisticsPublication . Verde, Rosanna; Batagelj, Vladimir; Brito, Paula; Silva, A. Pedro Duarte; Korenjak-Černe, Simona; Dobša, Jasminka; Diday, EdwinThe paper draws attention to the use of Symbolic Data Analysis (SDA) in the field of Official Statistics. It is composed of three sections presenting three pilot techniques in the field of SDA. The three contributions range from a technique based on the notion of exactly unified summaries for the creation of symbolic objects, a model-based approach for interval data as an innovative parametric strategy in this context, and measures of similarity defined between a class and a collection of classes based on the frequency of the categories which characterize them. The paper shows the effectiveness of the proposed approaches as prototypes of numerous techniques developed within the SDA framework and opens to possible further developments.
- Outlier detection in interval dataPublication . Silva, A. Pedro Duarte; Filzmoser, Peter; Brito, PaulaA multivariate outlier detection method for interval data is proposed that makes use of a parametric approach to model the interval data. The trimmed maximum likelihood principle is adapted in order to robustly estimate the model parameters. A simulation study demonstrates the usefulness of the robust estimates for outlier detection, and new diagnostic plots allow gaining deeper insight into the structure of real world interval data.
- Parametric models for distributional dataPublication . Brito, Paula; Silva, A. Pedro DuarteWe present parametric probabilistic models for numerical distributional variables. The proposed models are based on the representation of each distribution by a location measure and inter-quantile ranges, for given quantiles, thereby characterizing the underlying empirical distributions in a flexible way. Multivariate Normal distributions are assumed for the whole set of indicators, considering alternative structures of the variance–covariance matrix. For all cases, maximum likelihood estimators of the corresponding parameters are derived. This modelling allows for hypothesis testing and multivariate parametric analysis. The proposed framework is applied to Analysis of Variance and parametric Discriminant Analysis of distributional data. A simulation study examines the performance of the proposed models in classification problems under different data conditions. Applications to Internet traffic data and Portuguese official data illustrate the relevance of the proposed approach.
- Probabilistic clustering of interval dataPublication . Brito, Paula; Silva, A. Pedro Duarte; Dias, José G.In this paper we address the problem of clustering interval data, adopting a model-based approach. To this purpose, parametric models for interval-valued variables are used which consider configurations for the variance-covariance matrix that take the nature of the interval data directly into account. Results, both on synthetic and empirical data, clearly show the well-founding of the proposed approach. The method succeeds in finding parsimonious heterocedastic models which is a critical feature in many applications. Furthermore, the analysis of the different data sets made clear the need to explicitly consider the intrinsic variability present in interval data.
