ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Original Paper

Outlier detection of Yangtze River basin meteorological databased on robust S-estimator

Cite this:
https://doi.org/10.3969/j.issn.0253-2778.2018.11.001
  • Received Date: 18 April 2018
  • Accepted Date: 15 June 2018
  • Rev Recd Date: 15 June 2018
  • Publish Date: 30 November 2018
  • Outlier is unavoidable in high-dimensional data, such as meteorological data, and the the most widely used least-square method has no robustness and sensitivity in detecting outliers. Robust estimation can make the estimators not strongly influenced by outliers, so that the outliers can be better identified. By adding Tukey’s biweight function constraints, a principal component analysis model based on robust S-estimator was established, which converges rapidly and does not need to assume the specific form of the distribution function. Then the observations were smoothed by B-spline basis, the mean residuals squared norm was used as the test statistic, and the adjusted box-plot which also has robustness was trained to detect the outliers. In the example, more than 58 thousand measurements of meteorological data over 60 years of 5 cities in Yangtze River basin were adopted. A comparative analysis of the data set with outlier detecting procedure based on principal component analysis and robust S-estimator has been conducted. It can be seen clearly that compared with the classical approach, the outlier detecting procedure based on robust S-estimator gives more information on the abnormal data, and thus can identify outliers better.
    Outlier is unavoidable in high-dimensional data, such as meteorological data, and the the most widely used least-square method has no robustness and sensitivity in detecting outliers. Robust estimation can make the estimators not strongly influenced by outliers, so that the outliers can be better identified. By adding Tukey’s biweight function constraints, a principal component analysis model based on robust S-estimator was established, which converges rapidly and does not need to assume the specific form of the distribution function. Then the observations were smoothed by B-spline basis, the mean residuals squared norm was used as the test statistic, and the adjusted box-plot which also has robustness was trained to detect the outliers. In the example, more than 58 thousand measurements of meteorological data over 60 years of 5 cities in Yangtze River basin were adopted. A comparative analysis of the data set with outlier detecting procedure based on principal component analysis and robust S-estimator has been conducted. It can be seen clearly that compared with the classical approach, the outlier detecting procedure based on robust S-estimator gives more information on the abnormal data, and thus can identify outliers better.
  • loading
  • [1]
    况雪源, 王遵娅, 张耀存, 等. 中国近50年来群发性高温事件的识别及统计特征[J]. 地球物理学报, 2014, 57(6): 1782-1791.
    KUANG Xueyuan, WANG Zunya, ZHANG Yaocun, et al. Identification and statistical characteristics of the cluster high temperature events during last fifty years[J].Chinese Journal of Geophysics, 2014, 57(6): 1782-1791.
    [2]
    ROUSSEEUW P. Least median of squares regression[J]. Journal of the American Statistical Association, 1984, 79(388): 871-880.
    [3]
    ROUSSEEUW P, YOHAI V. Robust Regression by Means of S-Estimators[M]. New York: Springer, 1984: 256-272.
    [4]
    LI G, CHEN Z. Projection-pursuit approach to robust dispersion matrices and principal components: Primary theory and Monte Carlo[J]. Journal of the American Statistical Association, 1985, 80(391): 759-766.
    [5]
    HUBERT M, ROUSSEEUW P J, BRANDEN K V. ROBPCA: A new approach to robust principal component analysis[J]. Technometrics, 2005, 47(1): 64-79.
    [6]
    BALI J L, BOENTE G, TYLER D E, et al. Robust functional principal components: A projection-pursuit approach[J]. The Annals of Statistics, 2011, 39(6): 2852-2882.
    [7]
    MARONNA R, MATIN D, YOHAI V. Robust Statistics: Theory and Methods[M]. Chichester, England: John Wiley, 2006: 1-84.
    [8]
    BOENTE G, SALIBIAN M. S-estimators for functional principal component analysis[J]. Journal of the American Statistical Association, 2015, 110(511): 1100-1111.
    [9]
    HE X, ZHU Z, FUNG W K. Estimation in a semiparametric model for longitudinal data with unspecified dependence structure[J]. Biometrika, 2002, 89(3): 579-590.
    [10]
    HUBERT M, VANDERVIEREN E. An adjusted boxplot for skewed distributions[J]. Computational Statistics & Data Analysis, 2008, 52(12): 5186-5201.
    [11]
    HYNDMAN R J, SHANG H L. Rainbow plots, bagplots, and boxplots for functional data[J]. Journal of Computational and Graphical Statistics, 2010, 19(1): 29-45.
    [12]
    国家气象信息中心. 中国地面气候资料日值数据集(V3.0)[DB/OL].[2018-03-01] http://data.cma.cn/data/cdcdetail/dataCode/SURF_CLI_CHN_MUL_DAY_V3.0.html.)
  • 加载中

Catalog

    [1]
    况雪源, 王遵娅, 张耀存, 等. 中国近50年来群发性高温事件的识别及统计特征[J]. 地球物理学报, 2014, 57(6): 1782-1791.
    KUANG Xueyuan, WANG Zunya, ZHANG Yaocun, et al. Identification and statistical characteristics of the cluster high temperature events during last fifty years[J].Chinese Journal of Geophysics, 2014, 57(6): 1782-1791.
    [2]
    ROUSSEEUW P. Least median of squares regression[J]. Journal of the American Statistical Association, 1984, 79(388): 871-880.
    [3]
    ROUSSEEUW P, YOHAI V. Robust Regression by Means of S-Estimators[M]. New York: Springer, 1984: 256-272.
    [4]
    LI G, CHEN Z. Projection-pursuit approach to robust dispersion matrices and principal components: Primary theory and Monte Carlo[J]. Journal of the American Statistical Association, 1985, 80(391): 759-766.
    [5]
    HUBERT M, ROUSSEEUW P J, BRANDEN K V. ROBPCA: A new approach to robust principal component analysis[J]. Technometrics, 2005, 47(1): 64-79.
    [6]
    BALI J L, BOENTE G, TYLER D E, et al. Robust functional principal components: A projection-pursuit approach[J]. The Annals of Statistics, 2011, 39(6): 2852-2882.
    [7]
    MARONNA R, MATIN D, YOHAI V. Robust Statistics: Theory and Methods[M]. Chichester, England: John Wiley, 2006: 1-84.
    [8]
    BOENTE G, SALIBIAN M. S-estimators for functional principal component analysis[J]. Journal of the American Statistical Association, 2015, 110(511): 1100-1111.
    [9]
    HE X, ZHU Z, FUNG W K. Estimation in a semiparametric model for longitudinal data with unspecified dependence structure[J]. Biometrika, 2002, 89(3): 579-590.
    [10]
    HUBERT M, VANDERVIEREN E. An adjusted boxplot for skewed distributions[J]. Computational Statistics & Data Analysis, 2008, 52(12): 5186-5201.
    [11]
    HYNDMAN R J, SHANG H L. Rainbow plots, bagplots, and boxplots for functional data[J]. Journal of Computational and Graphical Statistics, 2010, 19(1): 29-45.
    [12]
    国家气象信息中心. 中国地面气候资料日值数据集(V3.0)[DB/OL].[2018-03-01] http://data.cma.cn/data/cdcdetail/dataCode/SURF_CLI_CHN_MUL_DAY_V3.0.html.)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return