Table 3A.

Document Classification Performance of Different Supervised Machine Learning Algorithms

Maximum entropy
No. of words/code1050100250500750100020004000
Iteration831091861041691041996569
Accuracy68.6272.7372.872.56 72.83 71.5471.4469.4767.66
Naı̈ve Bayes
No. of words10050010005000All
Accuracy63.89 66.92 66.8865.5963.79
Nearest neighbor
Neighbors No. of words
100 500 1000 5000 All
158.0454.0652.8453.2852.19
560.5257.5357.8458.3856.82
2059.7159.9160.861.8861.24
5059.2360.3961.85 62.9  62.26
10058.7660.2961.4162.7761.54
20056.6559.1660.0861.3160.05

[i] Document classification performance for three different algorithms on the Test 2000 dataset for a series of parameters. For maximum entropy classification, we attempted different numbers of word-features/code; also we tested the accuracy at each iteration of the GIS optimization algorithm. Here we report in each column the number of words/code used, the highest accuracy obtained, and the first iteration obtaining that highest accuracy. For naı̈ve Bayes classification, we calculated accuracy on different vocabularies. The size of the vocabulary and the accuracy is reported in each column. For nearest-neighbor classification we calculated accuracy for different numbers of neighbors and different vocabularies. The accuracy data is reported in a grid, with different numbers of neighbors for each row, and with different vocabularies for each column. The best performance achieved for each method is underlined.