-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
62 lines (51 loc) · 2.81 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
This file describes BayseanClassifier. It classifies documents into categories
based on the classic Baysean classification algorithm. A second program,
CategoryValidator, calculates the accuracy of the results.
Copyright (C) 2016 Ezra Erb
This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License version 3 as published by the Free
Software Foundation.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program. If not, see <http://www.gnu.org/licenses/>.
I'd appreciate a note if you find this program useful or make updates. Please
contact me through LinkedIn (my profile also has a link to the code depository)
The project consists of two programs, BayseanClassifier and CategoryValidator.
The former implements a classic Baysean text classification algorithm to
classify documents based on a training set. It removes stop words and
non-alpabetic words, and then applies Porter stemming to reduce the
dimensionality of the overall word space. The classifier expects the training
documents to be orgaized into directories by categoy, giving the following
structure:
root1
category 1a
document
document
....
category 1b
document
document
...
...
root2
...
Multiple directoy roots may be specified. Any number of documents to classify
may be specified, including directories. For a directory, all documents in the
directory tree will be classified. Document paths and categories are sent to
standard out.
Category Validator takes a directory tree of documents orgaizied into directoies
by category. The expected structure is the same as the training set for the
Baysean classifier. It compares this to the results file to calculate both
the precision and recall rates per categoy. Precision is the percentage of
documents classifed in a category that actually belong. Recall is the percentage
of documents in a category that were classified there. These two measues are
then combined into the balanced F measure statistic per category.
The classifier was tested through cross validation on a classic set of Usenet
posts. They were distributed between 20 news groups with 1000 posts per group.
The classifier attempts to select the news group for each post. 75% of the
posts in each group were used for training, the remainder for classification.
The F-statistic values varied per news groups, with closely related groups
having the lowest values. F values for diffeent news groups ranged from 0.61
to 0.98, in line with other Baysean classifier implementations.