caused by Telia (TeliaSonera)Published: 01-10-2013 | Author: Michel Greijmans
We monitor our clients from a server located in the US (NYC to be specific), our monitoring server monitors our clients' servers ever minute for many system services and uptime. For months now have I noticed that clients connected via a UTS connection are continuously facing "downtimes" (sometimes for hours, if not days), without actually being down (as I was still able to log in remotely, via RDP, (web)mail, and ssh).
After doing a bit if research I've noticed, mainly tracerouting, I've noticed a recurring similarity across clients. Packets would disappear/drop somewhere in the US (in my case). To be very specific: packets would drop on Telia (TeliaSonera) routes.
To prove I supply you with the below traceroute/MTR results. The below graphs I will show are 2 clients monitored from our (same) monitoring server in NYC. Both servers are connected via a UTS (Business) ADSL/VDSL Internet connection. Server 1 (Blue) is on the "East" of the island, Server 2 (Red) is on the west side of the island.
Graph: 48 Hours
In the above graphs you see the ping response results over the past 48 Hours (as of the time this article is written). Immediately you'd notice the simmilarity between the two graphs.
To repeat: these are 2 different servers, on 2 different UTS connections, in 2 different areas (let's just call them "West" and "East" for now). This same patern will repeat itself in the below graphs.
Graph: Past 7 Days
Graph: Past 31 Days
Apparently "longer ago" the routing/network to Server 2 (Red/West side) used to be better than Server 1's connection, however a serious networking problem remains visible.
The FLOW Comparison
As a comparison I've attached the graph of "Server 3", which is located in the same neighborhood as "Server 1", but is connected via a FLOW Cable connection.
The above graphs prove that the UTS network was facing serious network issues, however the above graphs don't show the specific points of failure. To prove this I used traceroutes, but more frequently MTR, which is a accelerated traceroute. The below results will clearly prove that packets get dropped at Telia (TeliaSonera) hubs.
Traceroutes From UTS
This first screenshot show a MTR session from a Windows Server on a UTS Connection (to be specific the "Server 1"), this screenshot/test was done on Sunday.
Below we find another MTR session during the time I'm writing this article (cont.).
Traceroute To UTS
Now this is where it gets interesting. Below you find a MTR session done from the monitoring server (Linux/CLI) to "Server 1". This test was done during the same period of time as the above screenshot (during the time of writing this article).
To be very honest I wasn't planning on writing this article at all, since I assumed the problem I was facing was just limited to me. Until today I tought I was the only one facing these issues, specifically to our monitor server.
However, this afternoon I heard from 2 friends of mine that they ware having the same issues. One of them able to provide me with definate proof that he's facing this problem to whole other countries/datacenters. Making it almost impossible for him to reach his server in the Netherlands from UTS.
Traceroute from Ziggo (NL) to UTS
The first traceroute he supplied me was a traceroute executed from his Ziggo connection at home:
Tracing route to sub-163ipXXX.rev.onenet.an [216.152.163.XXX] over a maximum of 30 hops: 1 1 ms 1 ms 2 ms 192.168.x.x 2 8 ms 20 ms 10 ms xxxxx.dynamic.ziggo.nl [83.86.xx.xx] 3 7 ms 8 ms 6 ms gv-rc0011-cr101-irb-201.core.as9143.net [18.104.22.168] 4 9 ms 10 ms 7 ms asd-lc0006-cr101-ae8-0.core.as9143.net [22.214.171.124] 5 * 9 ms 7 ms adm-b7-link.telia.net [126.96.36.199] 6 14 ms 11 ms 10 ms adm-bb3-link.telia.net [188.8.131.52] 7 18 ms 18 ms 21 ms ldn-bb1-link.telia.net [184.108.40.206] 8 98 ms 98 ms 97 ms ash-bb3-link.telia.net [220.127.116.11] 9 * * * Request timed out. 10 * * * Request timed out. 11 * * * Request timed out. 12 * * * Request timed out. 13 * * * Request timed out. 14 * * * Request timed out. 15 * * * Request timed out. 16 * * * Request timed out. 17 * * ^C
Traceroute from his Server (NL)
Next he supplied me with a traceroute from his Server in the Dutch datacenter Netrouting to a UTS IP:
Tracing route to sub-163ipXXX.rev.onenet.an [216.152.163.xxx] over a maximum of 30 hops: 1 1 ms 1 ms 1 ms xxxx [xxx] 2 1 ms 1 ms 1 ms ar1.spk.nl.netrouting.net [18.104.22.168] 3 24 ms 2 ms 12 ms r1.ams1.nl.netrouting.net [22.214.171.124] 4 1 ms 1 ms 1 ms adm-b7-link.telia.net [126.96.36.199] 5 2 ms 1 ms 1 ms adm-bb4-link.telia.net [188.8.131.52] 6 6 ms 6 ms 6 ms ldn-bb2-link.telia.net [184.108.40.206] 7 80 ms 80 ms 80 ms ash-bb4-link.telia.net [220.127.116.11] 8 * * * Request timed out. 9 * * * Request timed out. 10 * * * Request timed out. 11 * * * Request timed out. 12 * * * Request timed out. 13 * * * Request timed out. 14 * * * Request timed out. 15 * * * Request timed out. 16 * * * Request timed out. 17 * * * Request timed out. 18 * * * Request timed out. 19 * * * Request timed out.
Traceroute from UTS to his Server (NL)
He also supplied me with a corresponding traceroute from his UTS connection to his Server:
Tracing route to xxxx [xxxx] over a maximum of 30 hops: 1 1 ms 1 ms 1 ms xxx 2 4 ms 4 ms 4 ms sub-190-88-192ip1.rev.onenet.an [18.104.22.168] 3 4 ms 4 ms 4 ms sub-190-4-175ip12.rev.onenet.an [22.214.171.124] 4 51 ms 59 ms 51 ms 172.20.1.37 5 51 ms 51 ms 51 ms mai-b1-link.telia.net [126.96.36.199] 6 * * * Request timed out.
To prove his point even further he linked me to a handy tool that utilizes over 50 servers, spread all across the world/internet, to test ping responses. I did this test and I've the following screenshots showing the extensiveness of this problem:
BGP: "Route First"
In this paragraph I'm going to do a bit of explaining how this is happening, and why I think it's UTS's responsability.
Internet Routing: The Basics
As you must have previously heard, the internet, or for that matter, computer networks are made out of IP Addresses. Each (physical or virtual) machine on the internet has an unique IP address (there are exceptions, but I'm not going in to that). In order for your machine to communicate to another machine across the Internet your connection/packets pass through routers. Routers can be compared to traffic light that direct a packet/connection to the right direction. In almost all cases you pass through multiple routers before your packets/connection reaches it's destination (and back).
But in order for those routers to know what range of IP addresses to route through what "provider" or "network" they use a protocol called BGP (Border Gateway Protocol).
To keep it simple and short: each ISP (Internet Service Provider) has to configure their BGP, as this informs the Internet's routers what routes/networks to route it's connections to, to reach the ISP (and back).
The "Cheap" Factor
I've tried explaining the BGP/Internet Routing basics in the above paragraph. These are very sophisticated terms and standards, so don't worry if you didn't fully understand them.
The point I'm trying to get you to understand is that ultimately UTS tells the "Internet's Router's" what routes to use.
Another factor you need to understand is that UTS (just like us), needs to "buy" internet access from a provider. Providers in this case being very big international companies that run and maintain huge (dark)fiber networks across the globe. Because afterall the internet is just a bunch of lines/pipes. Examples of such companies are Columbus (known for they dark fibre cables here in the Caribbean), Level(3) Networks, Hurricane Electric (HE), Global Crossing, and many more.
Oviously some of these providers are more expensive than others, so an ISP (Internet Service Provider, in this case UTS) has to choose what fibre providers to connect to. During this process they use the BGP protocol to inform the Internet Routers world wide what route connections should make to- and from the UTS Network.
In the above discussed article the problem caused by Telia's (TeliaSonera) network.
Telia (TeliaSonera) is known to be a (relatively) cheap provider. From my observation UTS is clearly trying to route as much of their internet as possoble through this network to save on connection/peering costs.
Where lies the Responsability?
The big question however remains, who's ultimately responsable for bad connectivity. First of all: I'm not an lawyer. But if you want to talk about the contracts you (as a client) have with UTS, or for that matter UTS with a provider like Telia, you'd need to research the contract(s).
However, in my oppinion I think UTS is ultimately responsable for the fact that they are (as proven above) practically delivering a unstable and unreliable internet connection. Afterall, it is them who decide to primarely route all traffic via TeliaSonera. If they are aware of the above mentioned problems (and honestly I hope they did know, if not... wow.) they should urge TeliaSonera to improve their network, if not re-route their traffic via another provider.
By writing this article I'm hoping to prove the problems in UTS's OneNet Internet Services, and hopefully motivating them to improve their products and services. This article was not intended derogate UTS or it's employees in any way. FLOW is also going to get their fair share of crisisism (read here).
Keep in mind that I'm (currently a sleepy) human being. By that I mean there are probably quite a few spelling and grammar mistakes in this article. In case you find technical mistakes, or you disagree with something in this article, please let me know.
Feel free to use this article for re-publication, however please be so kind to link back to this page.
Lastly I would like to thank my sources (inc. Alexander Hughes) for their contribution to this article. Without them I couldn't have done this.