Markup | Genome Research

Table 3A.

Document Classification Performance of Different Supervised Machine Learning Algorithms

Maximum entropy
No. of words/code	10	50	100	250	500	750	1000	2000	4000
Iteration	83	109	186	104	169	104	199	65	69
Accuracy	68.62	72.73	72.8	72.56	72.83	71.54	71.44	69.47	67.66
Naı̈ve Bayes
No. of words	100	500	1000	5000	All
Accuracy	63.89	66.92	66.88	65.59	63.79
Nearest neighbor
Neighbors		No. of words
Neighbors		100	500	1000	5000	All
1	58.04	54.06	52.84	53.28	52.19
5	60.52	57.53	57.84	58.38	56.82
20	59.71	59.91	60.8	61.88	61.24
50	59.23	60.39	61.85	62.9	62.26
100	58.76	60.29	61.41	62.77	61.54
200	56.65	59.16	60.08	61.31	60.05

[i] Document classification performance for three different algorithms on the Test 2000 dataset for a series of parameters. For maximum entropy classification, we attempted different numbers of word-features/code; also we tested the accuracy at each iteration of the GIS optimization algorithm. Here we report in each column the number of words/code used, the highest accuracy obtained, and the first iteration obtaining that highest accuracy. For naı̈ve Bayes classification, we calculated accuracy on different vocabularies. The size of the vocabulary and the accuracy is reported in each column. For nearest-neighbor classification we calculated accuracy for different numbers of neighbors and different vocabularies. The accuracy data is reported in a grid, with different numbers of neighbors for each row, and with different vocabularies for each column. The best performance achieved for each method is underlined.