This program counts the characters, words and sentences in one or more HTML files, lists the average number of words per sentence for each file and builds a dictionary of all the words in the files. I use it to produce these statistics for this website.
I have written a book called The Lost Inheritance. It comprises over 250 inter-linked HTML files. Each file contains a chapter, a foot note article, an index, a topics list or some other demarcateable item of documentation. I wanted to know how many words I had written, and my average number of words per sentence, both in each file and in the book as a whole. I also wanted to know the total vocabulary I had used.
I prefer the flexibility of writing directly in HTML. I initially counted words in a single file by loading it into an HTML-enabled word processor and using its word counter. The one I used also gave a words per sentence figure. However, I only had a free trial beta version of this word processor which expired and I could not afford to buy the release version. Besides, this word processor would only deal with one file at a time. I wanted to be able to click on a program that would analyse my entire book as a single batch job and present me with a report.
I therefore wrote my own rendering of the age-old wc program which has graced every Unix installation since the beginning of time. Wishing to be considered up to date and 'leading edge', I decided to write it in Java. There are many different programs sharing the name wc.java. They appear as worked examples in just about every text book on Java. My version is somewhat more comprehensive than these.
This program counts the characters, words and sentences in all the HTML files to be found in a specified directory (folder) and all its sub-directories (sub-folders). For this purpose, wc.java contains a re-entrant method called scan(). scan() scans the specified directory and its sub-directories for HTML files. Whenever it finds an HTML file, it outputs the counts of its characters, words and sentences to the console and adds them into its count accumulators. It also writes these counts to a file called wc.txt.
Once all the HTML files have been scanned, the main() method in wc.java 'prints' to the console the total counts of characters, words and sentences for all the HTML files put together. It also writes these to wc.txt.
To do the actual counting, wc.java invokes an instance of a separate class. This is in wordCnt.java. The methods in wordCnt.java assume that the HTML files it works with obey the strict XHTML rules of containing no naked '&', '<' or '>' characters. With the strong push by the IT industry towards XML standards, I felt this was a safe and reasonable imposition. And it does make the programming so much simpler, faster and more compact.
Naturally, I made this program to count words only within the BODY of an HTML document. It does not count in the HEAD section. Furthermore, I only wanted my actual narrative or prose counted. I therefore made the program refrain from counting the contents of TABLES, LISTS and FORMS. This is because tabular information and lists are generally not sentences. Including their content would therefore falsely inflate the average number of words per sentence. I made the program exclude FORMS even though there are no forms in my book. Counting is also suppressed between APPLET tag pairs. HTML tags themselves are of course not counted. These are captured and filtered out by an instance of a class called HTMLtag.java.
The program builds a dictionary of all the words found in all the HTML documents contained in the specified directory and all its sub-directories. This it does by invoking an instance of another class dic.java. The methods in dic.java need each word to be presented as it should appear in the dictionary. This means that all HTML character entities within a word must be substituted before the word is passed to dic.java's submit() method. Character entity substitution is done conveniently by calling the capture() method in a static class charEnt.java.
dic.java adds to the dictionary each new word it encounters, and increments the occurrence count for each word it encounters which is already in the dictionary. It builds the dictionary in alphabetical order. When all files have been scanned and the dictionary is complete, a second dictionary is created by sorting the first one into descending order of word occurrence. The two dictionaries are then saved side by side in the file dic.txt.
A small test program called charentTest.java is included. It allows you to enter any character entity at the command line and check whether or not the main program interprets it correctly.
All designs, techniques and coding in this my version of wc.java were originated entirely by me, apart from C A R Hoare's 'quicksort' algorithm which I adapted for sorting the dictionary into order of word occurrence in dic.java. This software is published here solely as an example of my work and is not to be used, copied or adapted without my permission.