Take some time to think carefully about what you are looking for. Get a mental picture of the notions and concepts involved.
Try to think of a single word which describes what you are looking for as completely and as exclusively as possible. You may enter further keywords to limit the scope of your search. Up to 16 words can be handled but 3 is a sensible maximum.
Enter this word (or words) in the keyword entry field of the search engine applet above. Then click the 'Search' button.
Note: upper and lower case versions of a letter are regarded as the same letter. Separate multiple keywords with spaces, commas or both. The 'Clear' button clears the entry field. Accented letters are regarded as their equivalent without an accent for search purposes.
If the first keyword is found, the title, description and URL of the first relevant document then appear in the applet's main display area, otherwise a message appears saying that the keyword could not be found. Use the 'Next' and 'Prev' buttons to scan up and down the list of relevant documents.
Once you have found a document you would like to view, click the 'View Document' button. The full document will then be fetched and displayed in a separate tab in your browser's window. This leaves the search applet running so that you can return and select another document in the retrieved shortlist.
When you have finished reading the document, cancel its browser tab.
NOTE: When the applet first starts, it connects to the server and downloads the site's keyword index. This can take anything from a split second to about half a minute, depending on line speed and Internet traffic levels. If any problems occur during the downloading of the index or a document, an 'exception' message appears on the message line in red. This states the name of the Java method in which the problem occurred and the type of Java 'exception' (what most programmers used to call an error) which occurred. I would appreciate your reporting to me any such occurrence by email. Thank you.
This search engine applet searches only within the domain of this web site. It searches for given keywords within its keyword index.
The index is built off-line by an indexer. The indexer collects all the keywords listed within the keyword meta tag of each HTML file at this web site. It then sorts them into alphabetical order, keeping track of which files each occurred in. It does not extract keywords from the body of the file, ie the textual content of the document.
On finding a keyword in the index, the search engine applet looks up the relative URL of the first HTML file in which that keyword occurred. It then retrieves the title and description from this file. It gets the title from between the file's <title> and </title> tags. It gets the description from the 'content' part of the file's 'description' meta tag. It then displays these in the applet's own window area.
When the user clicks the 'next' button, the applet retrieves and displays the titles and descriptions of the other files in the list one by one.
If you enter more than one keyword for a given search, the search engine applet proceeds as follows. A full shortlist of HTML documents is retrieved for the first keyword. The first keyword thus always determines the length of the shortlist. It is the primary criterion for the search.
Each HTML document in the shortlist is then ranked according to how many of the subsequent keywords also appear in its keywords meta tag. The more keywords it contains, the higher its rank. The shortlist of relevant web pages is then re-ordered according to rank. The higher a document's rank, the closer it appears to the beginning of the shortlist.
Why does the indexer extract keywords only from meta tags and not from the body or text of the document itself?
Because at least half the keywords a user will think of when looking for information on a given subject will not actually appear in the content of the document they are looking for. The content will appear in the form of phraseology, which far more powerfully expresses the notions concerned than would the large keywords thought of by the user.
Conversely, many large and specific words - potential keywords - appear in the body (text) of the document. While some may correctly be key to the subject of the document, many may not. Many of them may have their legitimate part to play in the text, but do not convey what the document is essentially about. An automatic indexer would blindly include them whether relevant or not to the purpose of finding the right document for the user of a search engine.
Using the 'keyword' meta tag gives the human indexer full control over what keywords his document will be indexed under. This results in search engine listings in which the documents are 100% relevant to the subject matter being sought.
Why have a local in-site search engine at all? Why not simply let the user find stuff on this site using the major public search engines?
It is a sad fact that when technology provides a way, bureaucracy takes it away again. Once upon a time, when the Internet was essentially an academic facility, search engines simply indexed what was there. If it existed, users could find it. Not so now. And increasingly not so. The reason is that the Internet has been all but taken over by commercial interests. Consequently, what was a perfectly workable system has been driven completely pear-shaped by petty self-interest.
To attract potential custom to their sites, commerce has employed underhanded tactics like padding their keyword meta tags with false attractors. In other words, a commercial site selling trucks or modems will put keywords like 'sex' and 'erotic' simply to attract people to the site, even though their site contains nothing erotic or sexy.
This has become a problem for the major Internet search engines. Consequently, they have applied various rules in an attempt to combat this abuse of meta tags. Some have started to ignore keyword meta tags, extracting keywords only from content - the body of the document. Others exclude documents whose keyword meta tags contain any keyword which does not appear in the text of the document - a situation which is very likely in properly indexed documents.
Unfortunately, whatever rules or combination of such rules are applied, they generally tend to penalise, most of all, those documents which have been professionally indexed according to best practice. As a result, particularly since the summer of 1999, properly indexed sites have all too frequently found themselves wrongly excluded from the major search engines.
Alongside the major automatic search engines are the major Internet indexes. These are built by human web surfers. These people are given lists of web sites and are employed to examine and categorise the content they find at each site. The surfer or editor then decides whether or not the site should be included in their index.
But what criteria do these professional surfers use to determine whether or not a given site should be included or rejected? Who knows? They could be many and various. However, one must now at least suspect that one of these criteria will be whether or not the site is likely to provide a source of profit to the major Internet index concerned. One thing is obvious. A single human being, working according to prescribed criteria, cannot possibly second-guess what a world full of individual Internet users are and are not interested in, or what they should or should not be allowed to find.
This is, in effect, censorship of the Internet by the back door. It may not be under the control of a single authority. However, the fact that every participant is now essentially driven by the commercial prerogative means that this collective censorship is narrowly focused upon commercial self-interest and away from the free and open exchange of any and all knowledge and information.
Indeed, this site has been dropped from many search engines on which it consistently appeared for the first 18 months of its existence on the Internet. Furthermore, I have found that without trawling through thousands of entries in a search results listing, it is impossible to resolve most of the information content of this site using a major 'public' search engine. This is the reason for my writing this search engine applet and its associated off-line indexer.
Within this site, unlike in the Internet at large, a strict discipline is followed regarding the proper use of keyword meta tags. That is why, within this site, this search engine applet can provide far more effective results than can a major 'public' search engine.
I originally wrote this search engine according to the client-server model. It had a server-side index searcher called index.class. This contained search and retrieval methods which were invoked by the enquiring client-side applet via RMI [Remote Method Invocation]. It searched for keywords and retrieved the relevant URLs from a highly-tuned dataset comprising 6 data files. The applet simply handled the user input and the presentation of the results.
Unfortunately, there was a big problem with this. It required Java executables running server-side. This is all very well for large corporations. However, there is no way a lowly unemployed programmer like me would be allowed to run an executable on his ISP's mighty server. Certainly not on any web site service tariff that can be afforded by anybody existing on this miserly pittance called welfare. I therefore had no choice but to take the whole thing client-side, leaving nothing but passive files sitting on the server. As far as I can see, Java makes no provision for the random access of files across the Internet. In other words, you cannot write things like:
RandomAccessConnection rac = url.openRandomAccessConnection();
Nor should such provision be made. With RMI there is no point. And it is most unlikely that the economic and commercial restrictions imposed upon the likes of me would even occur to the inhabitants of Sun Microsystems.
So take the whole thing client-side is what I did. Index and all. This restricted the size of the index. Nevertheless, it is possible to contain the index of a fair sized web site client-side without any trouble. I'm not going to explain how the applet works at the moment, but here's the source code so you can work it out for yourself.
This process is done by me off-line whenever the web site has undergone substantial modification. The process is performed by a program called spider.java.
Sadly, since early 2019, Java programs will not run at all via the Web. See below to download and run the program directly on your computer. You need Java installed on your computer.
Note: mainstream warnings notwithstanding, this program will neither blow up your computer nor wreak any other kind of fanciful mischief. It simply writes on your screen. In the good old days it simply ran embedded within the web page where its static image is now displayed. The embedded applet still runs in pre-2017 versions of browsers with Java 1.6 installed and the Web Start version still runs in pre-2019 versions of browsers with Java 1.8 installed. To read the rancid history of this sad retrogression in Web functionality, please click here.
[Back to top.]
If you are using Microsoft Windows and a Security Warning box pops up saying that the application has been blocked from running because it is "untrusted" please click here. If you get similar messages with Linux, please click here.
You're probably using the Opera Browser. Click in the middle of the applet. Then click in the narrow space between the applet and the links just below it. Then click in the text field. You should be able to enter keywords now. This is an inexplicable anomoly of the Opera Browser. Also with Opera it is necessary to click the first button you click twice in order for it to work.
Browsers are updated from time to time. Sometimes an update can carry a serious bug that was not there before. An example is the Firefox browser cira April 2012. When you press the View Document button Firefox takes about 3 to 5 minutes to get around to loading the document. The remedy is to use another browser, the least problematic and most stable of which I have found to be SeaMonkey.
All you get is a blank grey rectangle marking the area of the applet's window, but nothing else appears?
Check the Java Console of your browser to see what kind of Java "Exception" has occurred. Perhaps you simply need to update your version of the Java Runtime Environment (JRE) to at least the version with which I last compiled the applet.
The "security" functions in some of the latest browsers can be problematic. On some settings, the browser does not permit applets to run at all. Other settings permit only so-called "certified" applets to run. That is, applets whose authors have registered the applet with some "authority" or other and acquired a digital certificate for the applet. I can't afford the cost of registration. Some browsers give you the option of allowing applets from a particular web site to run in your browser. Try to find how to configure your browser to allow applets from my website to run within your browser.
A text field and various buttons appear on the applet but you also see a very technical-looking red error message near the bottom of the applet area.
This probably means that your browser is allowing the applet to run but is denying it permission to download its own index from my server. Again, it is a problem with security settings. Somewhere in your browser's configuration (or settings) menu there should be an option for allowing an applet to download data either generally, or from specific named web sites.
This overly-tight security is only necessary with commercial web sites because some of them try to download programs to run native within your computer. My web site is completely non-commercial. Consequently, I have no motive to want to load programs into your computer. Besides, I don't think this is possible with Java. As I am led to believe, this is only possible with something called ActiveX, about which I know nothing.