Machine Learning Methods to Identify Mislabeled Training Data and Appropriate Features for Global Land Cover Classification

Machine Learning Methods to Identify Mislabeled Training Data and Appropriate Features for Global Land Cover Classification

Jonathan Cheung-Wai Chan, Matthew C. Hansen and ¹Ruth S. Defries
Laboratory for Global Remote Sensing Studies,
University of Maryland, College Park, MD20742, USA

¹ Also with Earth System Science Interdisciplinary Center,
University of Maryland, College Park, MD20742, USA
Tel +1 301 405 8696 Fax: +1 301 314 9299,
Email: jonchan@glue.umd.edu

Key Words
machine learning, filtering, mislabeled data, feature selection

Abstract
In an effort to improve the AVHRR 1km global land cover product (Hansen, 2000), the original training data set used to derive the supervised decision tree classifier is evaluated using machine learning methods with two focuses: to identify mislabeled pixels and to optimize the input feature subset. A filtering process is formed by three data modelers: Decision Tree Classifier, Instance Based Learning and Learning Vector Quantization. Cross-validation is used to label the entire data set into correct and incorrect classification. Incorrect classified instances are considered as possible mislabels. Consensus filtering is performed and the results show that 74% of the cases identified as mislabeled agree with expert knowledge. More mislabels were identified if those chosen by either one or two of the data modelers are considered, but agreement with expert degraded to 54.2-65.5%. A wrapper approach of feature subset selection is applied to the improved data set with mislabeled training data discarded. A total of 18 features were chosen from the original 41 features. A map generated using the improved training set was compared with an expert-improved classification. Overall agreements is 53.1% and by class agreements ranged between <25%to>70%. Future efforts would be needed to address the question of what leads to the disagreements and how machine learning methods can be better tapped for operational global land cover mapping.

Introduction
Global land cover information is a prerequisite input to many atmospheric models that describe Earth system processes (Seller et al., 1997). In the last two decades, remotely sensed data have been increasingly utilized to monitor global land cover changes on a regular basis. Previous studies have shown success in using AVHRR data to produce global vegetation map at 1km and 8km resolutions (Hansen et al., 2000; DeFries et al., 1998). For the global land cover map at 1km resolution, only training sites that have been interpreted from fine-resolution data Landsat Multispectral Scanner System (MSS) and Thematic Mapper (TM) as containing 100% of the land covers in interest were chosen. Details of the preparation of the training data set were documented in Hansen et al. (2000). Collecting global training data set is understandably difficult considering the diversity of each land cover type. Exhaustive in-situ validations are either too expensive to get or simply physically not accessible. Mislabels in training data set due to these confinements are not uncommon. Constant efforts have been made to improve the 1km product. This paper reviews the use of machine learning techniques to identify mislabels and to optimize feature subset in the training set.

Methodology

Filtering mislabels in training
Weisberg (1985) suggested the use of regression analysis to identify outliers in the training data. Those cases that could not be described by the model and have the largest residual errors are outliers. Motivated by the same idea, Brodley and Friedl (1999) used filtering to clean out mislabeled training data. The idea is to use different learning algorithms to generate various classifiers from the training set. The classifiers form a committee of data modelers. For cases that do not conform to any model, they can be either noise or exception. Since there is no way to distinguish between noise and exception in the training set, the outliers are treated as candidate mislabels. Brodley and Friedl (1999) have shown that filtering improved classification accuracy significantly for noise up to 30%. Whether consensus filters (to throw away cases that do not follow any model) or majority filter (to throw away cases that follow most but not all models) should be used depends on the objectives of the task. In this study, two sets of training labels were used and the second set is an improved version with additional expert inputs. Mislabeled cases from the first training set identified by machine learning filtering would be compared with the improved data set.

To mark each instance as correctly or incorrectly labeled, a n-fold cross-validation is performed over the training data. The training data is subdivided into n equal parts and for each of the n parts, a classifier is generated from the other n-1 parts. After the n trials, the whole training set would be marked. We used a 10-fold cross-validation for our experiments. For the committee data modelers, we have included Decision Tree Classifier, Instance Base Learning and Learning Vector Quantization.

Decision Tree Classifiers (DTC) recursively subdivide the training set into homogenous subsets according to certain split rules. They are fast in learning and the explicit tree structures enhance interpretation of the classification process. Our experiment used C5.0, a commercial successor of C4.5 developed by Quinlan (1993). The split rule of C5.0 is made according to information gain. Details of the information gain criterion can be found in Quinlan (1993).

Instance Base Learning (IB) is described as a lazy learner as they do not process the inputs until they are requested for information (Aha, 1998). As such, they have little training costs. IB is slow when there are many attributes and as it retains the solution to a similar problem, high classification costs is expected. More about IB can be found in Aha (1992) and Wettschereck (1994). We have implemented the Instance Base Inducer of the SGIMLC++ (Machine Learning library in C++, 1996) Utilities 2.0 for our experiments. The number of nearest neighbor is set to 1.

Learning Vector Quantization (LVQ) is a supervised version of Self-organizing Maps (Kohonen, 1995). By assigning codebook representatives to each class, LVQ defines borders resembling voronoi sets in vector quantization. It is a nearest-neighbor based classifier. Experiments were performed with the LVQ_PAKv3.1 available from the Helsiki University of Technology. We assigned 100 codebook representatives to each class. The first 5,000 steps of training with self-adjusted learning rate were implemented using OLVQ1. Then, training is carried on for 200,000 steps with LVQ1 using a lower learning rate 0.05. Snapshot is taken for every 10,000 step. The step with the highest accuracy rate was chosen for the trial.

Optimal feature subset
Finding an optimal feature subset for a task can improve interpretation and reduce computing costs. Reducing computing costs is particularly important in the context of global land cover mapping considering the size of the data set we are handling. Irrelevant or correlated features are reportedly damaging to learning algorithm like decision trees (John et al., 1994). Applying feature selection will downsize input dimension with the possibility of enhanced accuracy.

Both filter and wrapper approaches can be adopted for feature subset selection. Filter approaches use only the training data in the process of evaluation but wrapper approaches incorporated the induction algorithm as part of the evaluation in the search of the best possible feature subset. Wrapper approaches are reportedly superior when applied to decision trees resulting in smaller tree sizes (better interpretative power) and higher accuracies (Kohavi and John, 1998). For our experiment, we used a wrapper approach with decision tree classifier. The induction algorithm is run on the training data set using different subsets of the original features sets. The subset with highest evaluation is used to generate a classifier.

A forward selection procedure using best-first search is adopted (Ginsberg, 1993). Forward selection implies an operation of addition for each expansion. The search states are nodes representing subsets of features. The idea of best-first search is to jump to the most promising node generated so far that has not been expanded. The search is stopped when an improved node has not been found in the previous k expansions. An improved node is defined as a node that has an accuracy of not less than x percent higher than the best node found so far. For our experiment, k and x are defined as 5 and 0.001% respectively.

Data
The training data set is obtained by sampling from the AVHRR 1km global vegetation cover product (Hansen et al., 2000). A total of 37340 cases were extracted. Formation of the 41 metrics is described in details in Hansen et al. (2000). The training data is thoroughly shuffled to prevent any correlation between training and testing.

Discussions and Conclusions

**Table 1. Number of cases identified as mislabels**
	Number of cases
Expert knowledge	11732
LVQ	9058
DTC	7438
IB	5376

**Table 2. Agreements between individual data modelers**
	IB	DTC
LVQ	3725	4389
DTC	3200

Filtering methods to identify mislabeled data
An improved version of training using expert opinion is available for comparison. With expert knowledge, 11732 cases were identified as mislabels. Table 1 shows the number of mislabels identified by each individual classifier. The number of mislabels actually represents the classification errors of each learning algorithm. IB outperformed the other two classifiers in terms of accuracy rates and hence it identified the least amount of mislabels. LVQ has the highest number of mislabels, but that is still about 2700 pixels less than expert opinions. Table 2 shows the agreements among filters and LVQ and DTC has the highest agreement. Since LVQ has the highest number of mislabels, it also has a slightly higher agreement as compared to other filters.

**Table 3. Number of mislabeled cases identified using consensus/voting filtering and the agreements with expert knowledge**
Number of votes	Number of cases	Cases agree with expert knowledge
One Vote	13074	7080 (54.2%)
Two Votes	6280	4115 (65.5%)
Three Votes	2517	1863 (74.0%)

**Table 4. List of the 41 features used for the 1km map. The 18 features in bold fonts are those selected by the feature subset selection algorithm**
Maximum NDVI value Minimum NDVI value of 8 greenest months Mean NDVI value of 8 greenest months Amplitude of NDVI over 8 greenest months Mean NDVI value of 4 warmest months NDVI value of warmest month Maximum Channel 1 value of 8 greenest months Minimum Channel 1 value of 8 greenest months Mean Channel 1 value of 8 greenest months Amplitude of Channel 1 over 8 greenest months Channel 1 value from month of maximum NDVI Mean Channel 1 value of 4 warmest months Channel 1 value of warmest month Maximum Channel 2 value of 8 greenest months Minimum Channel 2 value of 8 greenest months Mean Channel 2 value of 8 greenest months Amplitude of Channel 2 over 8 greenest months Channel 2 value from month of maximum NDVI Mean Channel 2 value of 4 warmest months Channel 2 value of warmest month	Maximum Channel 3 value of 8 greenest months Minimum Channel 3 value of 8 greenest months Mean Channel 3 value of 8 greenest months Amplitude of Channel 3 over 8 greenest months Channel 3 value from month of maximum NDVI Mean Channel 3 value of 4 warmest months Channel 3 value of warmest month Maximum Channel 4 value of 8 greenest months Minimum Channel 4 value of 8 greenest months Mean Channel 4 value of 8 greenest months Amplitude of Channel 4 over 8 greenest months Channel 4 value from month of maximum NDVI Mean Channel 4 value of 4 warmest months Channel 4 value of warmest month Maximum Channel 5 value of 8 greenest months Minimum Channel 5 value of 8 greenest months Mean Channel 5 value of 8 greenest months Amplitude of Channel 5 over 8 greenest months Channel 5 value from month of maximum NDVI Mean Channel 5 value of 4 warmest months Channel 5 value of warmest month

Table 3 shows the results of filtering by voting. If we take all mislabels with any filter (One Vote), a total of 13074 cases were identified and 54.2% of those cases agree with expert knowledge. For majority vote filtering, i.e. at least two votes, then there are 6280 cases with higher percentage (65.5%) of agreement with expert knowledge. If consensus filtering (Three Votes) is performed, only 2517 cases were chosen. The agreement between consensus filtering and expert knowledge is the highest (74.0%). Our finding echoed the observations from Brodley and Friedl (1999) that consensus filtering is conservative in terms of throwing away good data and majority voting are better at detecting bad data but at the risk of throwing away good data. Our results show that filtering by consensus or by voting can be used to identify mislabeled training data with the lowest agreement with expert opinions at 54.2%. Since retaining mislabels would degrade performance, it is suggested to use majority vote since more mislabels can be identified and discarded. While we have shown that certain portion of expert opinion can be modeled by machine learning methods, more studies would be needed to address the problem concerning the disagreements between filtering by data modelers and expert knowledge.

Feature Subset Selection
The size of input dimension is extremely important when dealing with global data set since the task involves handling of millions of pixels. Our last experiment performed feature subset selection over the improved training data. We have chosen the mislabeled cases identified by majority voting (Two Votes) to be discarded. It has to be noted that there are 7612 mislabels picked by the expert that are not identified by machine learning procedures. Feature subset selection algorithm identified a subset of 18 features from the original 41 features. The input dimension is downsized by more than 50%. It will significantly reduce the processing time.

Table 4 listed the optimal features chosen by feature subset selection (FSS) algorithm. The inclusion of all NDVI metrics shows the utility of the normalized ratio in mapping global vegetative land cover. This agrees with the expert-modified tree classification where NDVI metrics were useful in mapping most cover types. Minimum annual red reflectance was also found to be important in mapping tree cover and the FSS includes it as well. Temperature bands, particularly minimum channel three, are useful in discriminating tree leaf type and leaf longevity classes. Compared to the expert-modified tree classification, an important metric missing from the FSS is the mean of the warmest four month metrics of channel 4 and/or 5 (features 33 /40 in Table 4). These metrics are useful in delineating tropical forest from woodland based on dry season land surface temperature, but are not included in the FSS output. In general, the FSS clearly retains the most important metrics, those which best generalize the data set. However, some metrics which act regionally, as just mentioned, are not included.

Map output/agreement
The overall agreement between a map generated using the training data set improved by machine learning and the expert-modified tree classification is 53.1%. Classes with the best agreement include needleleaf evergreen and deciduous forests, crops and bare ground (all >70%). The most poorly performing classes include wooded grassland, closed shrubs and grassland (all < 25%). Preserving classes with high intraclass variability such as wooded grassland can be problematic. This is highlighted here as the filtered data and FSS tree outputs resulted in an area of mapped wooded grassland of less than half that of the expert-modified map. Hansen et al. (2000) state that most training errors/confusion occurs between classes consisting of mixtures. An example of such confusion are areas of partial tree cover such as wooded grasslands and croplands which often exist in mosaics with naturally occurring land cover types. Objectively finding the appropriate thresholds to best depict mixed cover types such as wooded grassland is a challenge and further examination is required.

Acknowledgements
This research is supported by the NASA grants NAG56970, NAG56004, NAS596060 and NAG56364.

References

Aha, D.W., 1992. Tolerating noisy, irrelevant, and novel attributes in instance-based learning algorithms. International Journal of Man-Machine Studies, 36:267-287.
Aha, D.W., 1998. Feature weighting for lazy learning algorithms. In Feature Extraction, Construction and Selection: A Data Mining Perspective, edited by Liu, H., and Motoda, H., Kluwer Academic.
Brodley, C.E., and Friedl, M.A., 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11, pp.131-167.
DeFries, R., Hansen, M.C., Townsend, J.R.G. and Sohlberg, R., 1998. Global land cover classifications at 8 km spatial resolution: The use of training data derived from Landsat Imagery in decision tree classifers. International Journal of Remote Sensing, 19 (16), pp. 3141-3168.
Ginsberg, M., 1993. Essentials of Artificial Intelligence. Morgan Kaufmann.
Hansen, M.C., DeFries, R.S., Townsend, J.R.G. and Sohlberg, R., 2000. Global land cover classifications at 1 km spatial resolution using a classification tree approach. International Journal of Remote Sensing, 21 (6&7), pp. 1331-1364.
John, G.H., Kohavi, R., and Pfleger, K., 1994. Irrelevant features and the subset selection problem. Machine Learning: Proceedings of the Eleventh International Conference, pp. 121-129.
Kohavi, R. and John, G. H., 1998. The wrapper approach. In Feature Extraction, Construction and Selection: A Data Mining Perspective, edited by Liu, H. and Motoda, H., Kluwer Academic.
Kohonen, T., 1995. Self Organizing Maps. Berlin: Springer
Quinlan, R. J., 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann.
Sellers, P.J, Dickinson, R.E., Randall, D.A., Betts, A.K., Hall, F.G., Moonney, H.A., Nobre, C.A., Sato, N., Field, C.B., and Henderson-Sellers, A., 1997. Modeling the exchanges of energy, water, and carbon between continents and the atmosphere. Science, 275, pp.502-509.
Weisberg, S., 1985. Applied linear regression. John & Wiley & Sons.
Wettschereck, D., 1994. A study of distance-based machine learning algorithms. Doctoral dissertation, Oregon State University, Department of Computer Science.