Ayuda
Ir al contenido

Dialnet


Resumen de Critical values for 33 discordancy test variants for outliers in normal samples of very large sizes from 1,000 to 30,000 and evaluation of different regression models for the interpolation and extrapolation of critical values

Surendra Pal Verma Jaiswal, Alfredo Quiroz Ruiz

  • español

    En este trabajo fi nal de una serie de cuatro, usando nuestro procedimiento de simulacion bien establecido reportamos nuevos valores criticos o puntos porcentuales, precisos y exactos (con cuatro a ocho puntos decimales) de 15 pruebas de discordancia con 33 variantes y cada uno con siete niveles de signifi cancia �¿ = 0.30, 0.20, 0.10, 0.05, 0.02, 0.01 y 0.005, para muestras normales de tamanos muy grandes n de 1,000 a 30,000, viz., 1,000(50)1,500(100)2,000(500)5,000(1,000)10,000(10,000)30,000, esto es, 1,000 (pasos de 50) 1,500 (pasos de 100) 2,000 (pasos de 500) 5,000 (pasos de 1,000) 10,000 (pasos de 10,000) 30,000. Se reporta tambien el error estandar de la media en forma explicita e individual para cada valor critico. Como consecuencia, la aplicabilidad de estas pruebas de discordancia ha sido extendida a practicamente cualquier tamano de muestra estadistica (hasta 30,000 observaciones o aun Revista Mexicana de Ciencias Geologicas, v. 25, num. 3, 2008, p. 369-381 370 Verma and Quiroz-Ruiz INTRODUCTION Three recent papers (Verma and Quiroz-Ruiz, 2006a, 2006b; Verma et al., 2008a) have reported a highly precise and accurate Monte Carlo type simulation procedure for N(0,1) random normal variates and presented new, precise, and accurate critical values for 7 signifi cance levels �¿ = 0.30, 0.20, 0.10, 0.05, 0.02, 0.01, and 0.005, and for sample sizes n up to 1,000 for 15 discordancy tests with 33 variants. These tests were summarized by Verma et al. (2008a) and therefore will not be repeated here. For greater n (>1,000), practically no critical values are available in the literature for any of these tests (Barnett and Lewis, 1994; Verma, 2005).

    It may be pointed out that the critical values simulated by Verma and Quiroz-Ruiz (2006a, 2006b) and by Verma et al. (2008a) are for testing the discordancy of outliers in normal statistical samples under the assumption of some kind of a contamination model (see Barnett and Lewis, 1994 and Verma et al., 2008b for details on the possible contamination models). The outliers are simply extreme observations, irrespective of their discordancy, for example, an upper outlier x(n) or a lower outlier x(1) in an ordered sample array of n observations or data x(1), x(2), x(3), x(n-2), x(n-1), x(n). In an �guncontaminated�h normal sample these outlying observations will ideally not be discordant whereas in a �gcontaminated�h normal sample they are likely to be identifi ed as discordant. A �gstatistical�h sample (without any assumption for the population from which it was drawn) can actually come from any distribution such as a beta or a gamma distribution and not necessarily from a normal distribution. Only under the assumption that a statistical sample was drawn from a normal population and was probably contaminated in some way, it is true that the outliers in this sample should be tested using the discordancy tests that have been especially designed for normal samples (Barnett and Lewis, 1994). As an example, a statistical sample of experimental data (such as geochemical data) is most likely drawn from a normal population (Verma, 2005), in which the outliers can be tested as discordant (or not discordant) using the discordancy tests along with the critical values for normal samples (e.g., Verma and Quiroz-Ruiz, 2006a, 2006b; Verma et al., 2008a). For a statistical sample drawn from a different distribution the discordancy tests especially designed for that particular distribution along with the corresponding critical values, if available, will have to be used. Thus, the critical values for 33 discordancy test variants have been simulated for outliers in normal samples, with the possibility of their application for discordancy of outliers in statistical samples assumed to be drawn from a normal population.

    In inter-laboratory analytical studies for quality control purposes, the number of individual data (n) for a given chemical element in a reference material (RM) seldom exceeds 1,000, but this might become more common in future. In those cases, at present the multiple-test method (see Verma et al., 2008a and references therein) is not likely to be appropriately applicable due to the unavailability of precise critical values for n >1,000 for these discordancy tests. New critical values could therefore be proposed for n>1,000 through an adequate statistical methodology.

    Requirements of critical values for very large n (>1,000) also exist in an altogether different fi eld of molecular and cellular proteomics (Murray Hackett, written communication, June 2007 and February 2008).

    For the present work, we have included most discordancy tests for normal univariate samples (15 tests with 33 test variants; see Table 1 in Verma et al., 2008a) for simulating new, precise, and accurate critical values for the same 7 signifi cance levels (�¿ = 0.30 to 0.005) and for very large sample sizes n, viz., 1,000(50)1,500(100) 2,000(500)5,000(1,000)10,000(10,000)30,000, using a highly precise and accurate simulation procedure described earlier (Verma and Quiroz-Ruiz, 2006a, 2006b; Verma et al., 2008a). The above is a rather standard nomenclature to express the availability of critical values (see, for example, Barnett and Lewis, 1994) and has been used by us in the past (e.g., Verma and Quiroz-Ruiz, 2006a, 2006b). As an example, the expression �g1,000(50)1,500�h actually means mayores). Este conjunto fi nal de valores criticos para tamanos muy grandes cubrira cualquier necesidad presente o futura de aplicacion de estas pruebas de discordancia en todos los campos de las ciencias e ingenierias. Dado que los valores criticos fueron simulados para pocos tamanos de muestra entre 1,000 y 30,000, seis modelos de regresion diferentes fueron evaluados para la interpolacion y extrapolacion de los datos y se demostro que un modelo combinado de logaritmo natural-cubico es el mas apropiado.

    Es la primera vez en la literatura mundial que se demuestra que una transformacion logaritmica del tamano de muestra n antes de un ajuste polinomial resulta mejor que los ajustes convencionales desde lineal hasta polinomial de tercer grado usados a la fecha. Finalmente, usamos 1,402 conjuntos de datos de la proteomica cuantitativa con el fi n de demostrar que nuestro metodo de pruebas multiples funciona mas efi cientemente que el metodo robusto MAD_Z usado para procesar estos datos y, de esta manera, ilustrar la utilidad de nuestro trabajo fi nal en estas lineas.

  • English

    In this fi nal paper of a series of four, using our well-tested simulation procedure we report new, precise, and accurate critical values or percentage points (with four to eight decimal places) of 15 discordancy tests with 33 test variants, and each with seven signifi cance levels á = 0.30, 0.20, 0.10, 0.05, 0.02, 0.01, and 0.005, for normal samples of very large sizes n from 1,000 to 30,000, viz., 1,000(50) 1,500(100)2,000(500)5,000(1,000)10,000(10,000)30,000, i.e., 1,000 (steps of 50) 1,500 (steps of 100) 2,000 (steps of 500) 5,000 (steps of 1,000) 10,000 (steps of 10,000) 30,000. The standard error of the mean is also reported explicitly and individually for each critical value. As a result, the applicability of these discordancy tests is now extended to practically all sample sizes (up to 30,000 observations or even greater). This fi nal set of critical values for very large sample sizes would cover any present or future needs for the application of these discordancy tests in all fi elds of science and engineering.

    Because the critical values were simulated for only a few sample sizes between 1,000 and 30,000, six different regression models were evaluated for the interpolation and extrapolation purposes, and a combined natural logarithm-cubic model was shown to be the most appropriate. This is the fi rst time in the literature that a log-transformation of the sample size n before a polynomial fi t is shown to perform better than the conventional linear to polynomial regressions hitherto used. We also use 1,402 unpublished datasets from quantitative proteomics to show that our multiple-test method works more effi ciently than the MAD_Z robust outlier method used for processing these data and to illustrate thus the usefulness of our fi nal work on these lines.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus