Bloody Routers

A faulty router can make your computer effectively blind to a particular server or even whole areas of the Internet. So, if one day you can't access your email or find a website that you know is on-line, the cause could be a faulty router. But what can you do about it? [versão português]

"pop.20m.com" is a commercial email server that hosts mailboxes for probably hundreds of individuals, companies, clubs, groups and just about every other kind of organizational entity. It is vital for a diversity of people all over the world that this server be reachable.

In the late evening of Saturday, 2 June 2007, I downloaded my email as usual from my account on "pop.20m.com". However, on the morning of Sunday, 3 June 2007, when I tried to download my email, my email client, Mozilla Thunderbird, could not contact "pop.20m.com".

Problem Not With POP server

My first suspicion was that "pop.20m.com" server was off-line temporarily. However, when I contacted the administrator via another email service, he assured me that it was on-line and working, and indeed had been all the time. I checked with two friends - one in Utah (close geographically to the server) and one in London-UK. Both could reach "pop.20m.com".

Not My Email Software

So then my suspicion turned to my email client. Perhaps the pop settings had become corrupted or somehow altered. But no. They were as they were and perfectly correct. Next, I tried accessing my mail through another email client Evolution. Again, the server pop.20m.com could not be contacted.

In frustration, I disconnected my computer (running Unix) from the Internet cable and in its place connected my laptop running Microsoft Windows XP. I tried again to access my email through the email client program Outlook Express. I had used this to download my email constantly for 3 years up until about 6 months ago. The settings must therefore be correct as nothing on the laptop had been altered since I put it in the cupboard 6 months previously. Again, pop.20m.com could not be contacted.

Ran "Traceroute"

I needed to know why pop.20m.com could not be contacted. I reconnected my Unix machine to the Internet, opened a terminal window and ran "traceroute". In fact, I ran it several times, finally extending it up to 100 possible router hops and waiting up to 60 seconds for a response from each router en route. This was the result. The local name-server resolved the name pop.20m.com correctly to the IP address 64.136.25.170. As you can see from the trace, the final router to return any information was PO9-0.ARC-RJ-ROTN-01.telemar.net.br (200.223.131.97).

My first thought was that this was the guilty router that was not passing on my IP packets but was instead just dumping them because it did not know where to send them. True, it may not know where to send them, but this could be because it is not the job of this router to forward packets to destinations like pop.20m.com. The guilty router could be the one before it, namely PO9-0.HGA-MG-ROTN-01.telemar.net.br (200.223.131.117), which was erroneously forwarding my packets to PO9-0.ARC-RJ-ROTN-01.telemar.net.br (200.223.131.97).

To resolve this question, I tried tracing routes to addresses adjacent to that of pop.20m.com (64.136.25.170), namely the same IP address except with the last 3 digits 168, 169, 170, 171 and 172, to see if they could be reached, and the sequence of routers in each route. Here is the result for a route trace to all 5 IP addresses, including robmorton.20m.com (64.136.25.171), the web server on which you are viewing this site.

Each of the other 4 traces reached its destination in less than 30 hops. In every trace, the first router in the chain never returns any information as to its identity. The second router also usually returns no information but sometimes identifies itself as 10.11.0.202. This router may or may not be present in the routes where the second router does not identify itself. However, in all cases, the third router identifies itself as 200.149.249.57. But from here on, 3 of the routes go via 200.195.78.42 and 200.195.78.26, while the other two routes go via 200.195.78.6 and 200.195.78.30. The blocked route goes via the former pair of routers and then onwards via

PO9-0.HGA-MG-ROTN-01.telemar.net.br (200.223.131.117)
PO9-0.ARC-RJ-ROTN-01.telemar.net.br (200.223.131.97)

from whence the route goes nowhere. Since neither of the above two Telemar routers feature in the other 4 routes, it is possible that the fault lies with router 200.195.78.26 in that it is passing my packets to the wrong down-stream router. Because this router does not identify itself, I do not know whether it is part of WayTV's (my ISP's) network or whether it is part of the Brazilian backbone run by Telemar. On the other hand, the fault could lie with the above two Telemar routers, neither of which features in the routes to the other 4 destinations. Somehow, between Saturday night 2 June and Sunday morning 3 June, a routing table became corrupted. Or perhaps it was at that time deliberately changed for some reason best known to WayTV or Telemar. I don't know. However, it means I can no longer access my email from my email account.

I am in Belo Horizonte-MG, Brazil. My POP server is in Orem-UT, USA. The reason I have my mail server in the USA when I live in Brazil is that I already had it long before I came to Brazil, when I lived in the UK. It is very convenient to have it in the USA, where most of my correspondents are located. I have had it for over 6 years and its address is known by many people all over the world. Trying to track down everybody who may need to know a new email address would be a gargantuan task.

To make sure that pop.20m.com was indeed reachable from outside the USA, I asked a friend to do a route trace from Wembley, London, UK. My friend's computer ran Windows XP on which the route tracer is called "tracert" instead of "traceroute". This was the result. Very efficient. Got there in only 10 hops. From this, I thought it safe to conclude that the problem was with the WayTV or Telemar routers in Brazil. So I embarked on the grand and testing mission of trying to inform WayTV and Telemar about the problem so that hopefully they would see their way to correcting it.

Informing My ISP

My connection to the Internet is 70kbps both ways through a co-axial cable that also carries about 70 TV channels. The service is provided by Way TV BH S.A., a local provider in Belo Horizonte. I therefore thought it appropriate to contact their customer help service (known as SAC) to get the problem solved.

The person on the telephone said categorically that the problem was with my email client software. They asked which client software I used. I told them that I had already checked this thoroughly and that I had traced the problem to a router fault somewhere in Brazil. The person insisted that it must be my email software and persisted with asking me what I used. I told the person I used Mozilla Thunderbird. I said I had also tried with Evolution. Finally, I said that I had also tried with Outlook Express on my laptop with exactly the same problem. This telephone conversation lasted over half an hour. I needed a break.

A bit later I rang again. This time I emphasised that my email program was absolutely correctly configured and that the problem was not with that. The person to whom I was then speaking deduced that the problem must be with the email server. He had asked if I were using Way's own email server. When I explained that my email server was in the USA, the person replied that the problem was not with WayTV and that I would have to contact my email service providor in the United States. I told the person I had already done this and had also confirmed that the US email server was reachable not only from within the United States but also from the United Kingdom. The problem was in the Brazilian part of the route between my computer and the server in the United States, and that I had tracked it down to one of 3 possible routers. To this, the person replied that if the problem were within the Telemar network, then that was beyond WayTV's local network and was not therefore WayTV's responsibility, and that I would have to contact Telemar.

I asked to speak to the person's supervisor or to a technician with more specialized knowledge. My request was firmly denied. Later, I used the email form on WayTV's web site to send full details of my route traces together with a full explanation. Two weeks later I had still received no response.

This is all very strange in the light of a similar situation that occurred in Febrary 2005. This event was more serious in that I could not even access my own website. There was a large number of other website that I also could not access. However, this time, when I sent them a full explanation by email, WayTV fixed the problem within two days, apologizing for the inconvenience. Why the change in attitude? What has changed at WayTV?

Contractural Relationship

I have a contract with WayTV to provide me with Internet access. I have no contract with Telemar, which operates the backbone network to which WayTV subscribes. Presumably, WayTV has a contract with Telemar to provide them with access to the Brazilian backbone, which in turn is connected by mutual agreement with backbone operators in other countries.

This means that WayTV is responsible to me for making the service work properly, while Telemar is responsible to WayTV for making the service to them work properly. Consequently, to my mind, correcting their own routers, or informing Telemar of a possible fault in a Telemar router, is the responsibility of WayTV with whom they have a contract for service. However, WayTV did not see it that way.

Contacting Telemar

I researched widely on the Web to try to find a way to contact Telemar about their possible router fault. I found Telemar's website. This advertised that Telemar had an excellent telephone support service (SAC). However, it omitted to give the number to ring. After literally hours of searching, all I could find was the normal help line for telephone problems. I rang and explained about how I had traced an Internet connectivity fault to one of possibly two of their routers. I gave the addresses of the two routers that needed investigation:

PO9-0.HGA-MG-ROTN-01.telemar.net.br (200.223.131.117)
PO9-0.ARC-RJ-ROTN-01.telemar.net.br (200.223.131.97)

Telemar's telephone help person then asked indignantly how I had obtained the IP addresses of their computers because even they did not have access to these computers. The tone of the question implied that I had been illegally hacking into two of Telemar's corporate computers and that this was a serious matter. I had explained that I had simply run the "traceroute" program to find out where the connectivity to my email server was being blocked and that "traceroute" (or an equivalent program) existed on just about every PC in the world.

This person clealy knew nothing about the Internet or the Telemar backbone. This is not unreasonable. What is unreasonable is that the person did not know how to put me in contact with somebody who did know about these things. That was the big failing. It was obvious that I was going to get nowhere trying to talk to Telemar directly. I had to seek another way.

Implications

Router problems can, for various end-users at various times, render small parts or vast areas of the Internet inaccessible. In a global society that relies more and more on the Internet for its communication needs, this can be critically disruptive to commercial, academic and private endeavours. In countries where such faults typically persist at most for a few hours, the problem is not too serious. However, where such faults persist, or become permanent, the implications are very serious. The fault that is preventing me getting at my email has so far persisted for almost a month.

It is perhaps an impossible task for the technical administrator of a router to keep track of whether or not his router is forwarding packets to their correct destinations all the time. He is only human. Errors are justified. On the other hand, the vast multitude of Internet end-users is very sensitive to router faults. I knew within half an hour of my attempt to access my email on the morning of Sunday 3 June 2007 approximately where the fault was located. The inexcusable problem is that there exists no way of reporting such a fault to a person capable of understanding it, let alone fix it. So presumably, the same will be the case no matter how many other router faults may occur in the future. For all proctical purposes, they will be cumulative and permanent.

Seeking Help

I searched far and wide on the Web to try to find out how to report a router fault. I could find lots of commercial propaganda about router products and how to set up one's own router. But absolutely nothing on how a user can report a suspected router fault to a router administrator. I tried the obvious. I sent my route traces with a full explanation to admin@200.223.131.117 (the IP address of the Telemar router) but naturally no such mailbox existed.

Next, I tried searching the Web for the names of the two suspect Telemar routers. I was lucky. I found references to these routers in two blogs. I emailed what I judged to be the most relevant contributor from each blog. They both very kindly replied. One was very critical of the telecom support services and said that my best option was to write to Anantel (Brazil's telecoms consumer watchdog). The other was the network administrator for a university faculty. He said that it is essentially the obligation of WayTV to solve the problem, but that it could be very complicated if it involved routers of suppliers of their suppliers etc.. Both of these people ran a route trace to pop.20m.com and found it blocked at the same routers. Both used Internet Service Providers that were different from me and from each other. Access to the Internet was therefore for each of us through entirely different routes. This rather points the finger at the Telemar backbone routers.

I have communicated to WayTV 3 times by telephone, twice by direct email and once through their website reply form. So far (29 June 2007) I have had no response that is at all helpful. My only option at this point, it seemed, was to seek the help of Anatel.

But Events Intervened

Each morning since 3 June 2007, I had run traceroute to see if I could reach "pop.20m.com". Almost a whole month went by without being able to get pop access to my email. Then, on the afternoon of 2 July 2007, suddenly the route to "pop.20m.com" opened up. I got there in just 14 hops. I tried again around 6pm and "pop.20m.com" was still reachable. I downloaded the 93 emails that had accumulated over the month into my Mozilla Thunderbird email client. I was over the moon.

Notwithstanding, this was a router fault that lasted from 3 June to 2 July 2007 inclusive. That's 30 days. Interruptions of this length of time in this day and age are really not acceptable. Router faults generally last no more than a couple of hours or so. What is even more unacceptable is that, despite my incisive inquiries, I received no acknowledgement or information from my ISP (WayTV) or from Telemar (the backbone operator). Absolutely no communication. None of their customer help (SAC) staff knew anything.

Judging from the latest route traces, it would appear that during the month of June 2007, the Brazilian backbone underwent a grand metamorphosis. None of the original backbone routers seem to feature in the route any more. Accessing a server in the USA, I am out of Brazil in 4 hops. Brilliant. Perhaps this is the reason for all the disruption. But, as an enquiring affected user, they could have at least told me. I had no idea when and if the blockage would end until it actually did.

The question is, will it happen again? How likely is it to happen again? How many servers in the outside world did this fault really render unreachable? And, bearing in mind that the majority of users would not know anything about route tracing, how many people were affected unknowingly and suffered disruption to their work as a result?

In my opinion, users should have the right to a direct and efficient channel for reporting router faults so that they get corrected promptly.


© 3 July 2007 Robert John Morton