A Survey of Internet Incidents & Policy Advice
Internet Incidents & Policy Advice
Author: Jesse M. Caulfield
Publication Date: June 18, 2007
A brief study of how the Internet industry quickly evolved to overcome engineering errors, natural disasters and sabotage to provide a vital growing and transformative global communication service.
PDF: Network Neutrality White Paper – Internet Incidents & Policy Advice
Abstract:
The modern Internet was born with modest roots. Begun as a US government-funded research network in the late 1960′s, the early Arpanet provided data connectivity between about 100 universities and government facilities to support academic research and collaboration.
In the decades since, a few key transitions and many unfortunate incidents contributed to a radical evolution of this new communications system. One of the most important transitions occurred in 1983 when Arpanet changed from a single, centrally controlled network into a “network of networks” that would eventually grow to become the Internet.
Early Internet engineers, customers and service providers had great expectations but could not have foreseen the extent and speed that the Internet and private networks using Internet technologies would displace other voice, video and data networks. By the mid-1990s technology visionaries began to think of the future Internet as a global system capable of supporting all types of digital communication including traditional web services plus voice and broadcast video.
As the Internet has grown to become the dominant global communications platform it has also encountered many, sometimes serious, obstacles to its success. Yet time and again, a fiercely competitive yet highly collaborative and dynamic Internet industry has evolved to develop new technologies, business strategies and operational procedures to maintain its astonishing pace of upgrades and improvements while ensuring near uninterrupted service for the widest possible audience.
This document will examine three typical categories of incidents encountered by the Internet industry and discuss how people, companies and communities of interest came together to overcome each challenge, correct mistakes or reinvent their businesses, and ultimately improve the Internet as a whole. We will conclude with a summary of lessons learned and implications for how public policy may continue to foster and support a vibrant Internet industry as it prepares for an expected massive increase in bandwidth demand in the coming years.
Category 1: Configuration Errors
The Chicago Network Outage, October 21, 2005
The Internet is composed of many independent network providers, each operating their own private facilities. Each network is called an autonomous system (AS) and exchanges routing information with other networks using the Border Gateway Protocol (BGP). Engineers configure routers with BGP to filter, manage and control the routing instructions that are received and announced between autonomous systems. Errors or mistakes in configuring BGP can create routing errors and cause parts of the Internet to become unreachable for customers using the misconfigured network.
On Thursday, October 21st at about 2AM in Chicago, a BGP router was misconfigured during a typical network upgrade and caused a major American Internet service provider to disconnect from the Internet, isolating themselves and all of their customers. While most affected customers were asleep at the time and never noticed the incident, several hundred thousand homes were affected by the outage until service was restored almost four hours later; just in time for breakfast.
There are few reasons why a major network would isolate itself from the Internet. In this case it was to limit the spread and effect of an unfortunate engineering mistake from other networks and the Internet as a whole. This worked because the Internet is a collection of many private networks and when one autonomous system encounters a service outage others may also be affected depending upon their connectedness. Similarly, other independent networks can continue to operate and intercommunicate with each other should one disconnect.
What happened on October 21st is an example of route-leaking or the introduction of internal packet routing instructions into the network service provider’s external or global routing table. Route-leaking is the functional equivalent of handing a New York City cab driver a street map from San Francisco with directions to find your favorite restaurant at maximum speed. It is a sure recipe for network misdirection, congestion and outage.
There are many historical examples in the Internet’s early years of route misconfigurations and a few that caused broad-based outages. This catalog of incidents is matched by a body of active research and development to improve network routing architectures, invent automated network survivability and repair tools, re-define industry maintenance practices and implement new, cooperative procedures between network operators and engineers across the industry. All of this research and development was a factor in restricting the effect of the Chicago outage to one network and limiting its duration to only four hours.
In the United States today 20% of all network downtime are due to planned network maintenance. (5) Because of engineering preparations and architectural strategies, 70% of all unplanned network outages affect only a single customer at a time. Network service providers continue to develop ever more reliable operational planning and emergency outage procedures to prevent mistakes and to coordinate their correction should they occur.
Additionally, the Internet’s goal is universal connectedness. This must be balanced by the business requirements of stability and operational manageability which are made easier by network isolation. Each network operator must maintain a delicate balance between maximizing customer connectedness while ensuring they can continue to provide a robust and reliable service. This balance is unique to each company’s situation and directly reflects the requirements and wishes of their customers plus the needs of the industry as a whole.
Category 2: Physical Damage
Fiber cut affects the West Coast, January 9, 2006
In the mid-day hours of Monday, January 9, 2006 tens of thousands of wireless, long-distance telephone and Internet customers along the West Coast found themselves without service after a fiber-optic cable was cut near Phoenix, Arizona. Contractors were digging to install Cable-TV in a rural area when their backhoe unexpectedly struck and severed a buried fiber-optic cable.
Ordinarily a single fiber cut would not create a service outage since most fiber-optic networks are built in a self-healing ring topology that provides a back-up path and guarantees near instant service restoration should a segment failure or cut occur. This unusual confluence of events for West Coast customers began a few days earlier when a stormy mud slide inflicted fiber damage to a ring segment almost 45 miles Northwest of Reno, Nevada, placing the network into an unprotected state while all traffic was backed up and rerouted South through the Phoenix path.
Unfortunately for all affected, the January outage was a case of physical damage made worse by very bad luck. According to the Common Ground Alliance, an industry group of utilities and construction companies, over 185,000 excavation-related accidents occurred in 2004 where underground telecommunications cables or fiber were partially damaged or severed.
Most states have laws requiring a facilities inspection request up to two days prior to any digging or excavation work. Such One-Call requirements have increased awareness of buried facilities and reduced but not eliminated the number of incidents. The good news is that most incidents that do occur typically affect only local facilities and are either re-routed over back-up paths or do not cause widespread outages. However, when a dual fiber cut does occur they can cause severe outages and require extensive repair, affecting more customers and lasting longer than other types of incidents.
Nationwide, fiber paths cross local, county and state boundaries multiple times as they reach across the country. A single state’s One-Call number and their facilities inspection teams may be insufficient for work in border areas and can sometimes be confusing for contractors. To simplify the situation, in 2007 the Federal Communications Commission created 811: a nationwide call-before-you-dig phone number clearing house for 50 individual state programs. After almost a year of coordination, 811 was formally launched May 1, 2007 by the Common Ground Alliance with broad support from US industry.
While 811 won’t completely eliminate excavation-related outages, the hope is to provide a simpler method to “know what’s below” and reduce the number of preventable mistakes. Additionally, network operators must continually re-examine and optimize their physical infrastructure to manage growth, ensure sufficient redundancy, and maximize service availability.
Taiwan Submarine Earthquake, December 26, 2006
On Tuesday, December 26, 2006 a powerful 7.1 magnitude earthquake 15 kilometers south of Taiwan triggered undersea avalanches and damaged an unprecedented seven undersea communications cables. Internet, telephone and television services between China, Taiwan, Japan, Hong Kong, Korea, Singapore and their global trading partners the United States and Europe were affected.
Never before had so many independent undersea cable systems been damaged simultaneously: of the nine cables that pass through the Luzon Strait between Taiwan and the Philippines, only two cables remained in service.
Some observers incorrectly assumed the service interruptions were the result of a lack of investment or insufficient network capacity. However, regional and international demand was large enough to support nine independent cable systems and upgrades to three in the previous year to accommodate growing traffic volumes.
Services quickly returned to normal in the following weeks as regional traffic was gradually re-routed, sometimes around the world through Europe and North America, and the undamaged cable systems were reconfigured to accommodate the additional emergency traffic while those damaged systems were repaired.
In the case of a physical network break route diversity is always the best solution. Internationally, the United States can reach its trading partners in Asia by sending traffic West across the Pacific and East through Europe. Domestically, because of healthy investments by the Internet industry, the United States enjoys a good amount of inter-city and inter-state route diversity. While cables may occasionally be cut or damaged, it’s rare that they cause widespread outages.
Category 3: Sabotage
The Code-Red Worm, July 12, 2001
On Thursday, July 19, 2001, more than 3,500,000 computers connected to the Internet were infected with the Code-Red (CRv2) worm in less than 14 hours. The cost of this digital epidemic in terms of lost productivity and corrupted data has been estimated to be in excess of 2.6 billion dollars.
A first strain of the Code-Red worm began to infect computers on July 12th, 2001. Approximately one week later, on the morning of July 19th, 2001, a new, more potent variant of the Code-Red worm (CRv2) appeared and began to spread. This second version contained more an aggressive and efficient propagation algorithm than the first and spread much more rapidly.
The Code-Red computer virus exploited a security vulnerability in Microsoft web servers and presented a serious security risk to the data on an infected system. Because it propagated so rapidly the virus surprised many Internet service providers when it caused unexpected and widely experienced Internet congestion.
At it’s peak the Code-Red worm infected over 2,000 hosts per minute. Some researchers believe this incredible spread was only limited by the active intervention of computer system administrators, the implementation of traffic filtering and dampening by network service providers, and by actual Internet congestion brought on by virus’s own unquenchable consumption of resources.
The Code-Red worm was the first worm to affect nearly everyone on the Internet, either from actual infection or collaterally via congestion. It was eventually defeated by a combination of software patches released by Microsoft and the implementation of dynamic network countermeasures.
As a result of the Code-Red incident, network service providers now understand that every device connected to the Internet is a part of the Internet. Computer software and security vendors now play an active, collaborative role with network service providers to help to secure connected devices from worms, viruses or attack. The Internet industry as a whole must also continue to research and develop new network monitoring and management techniques and technologies to minimize the spread and impact of malicious software programs when they are released.
Lessons Learned
The Internet’s dynamism is its principal strength. As consumers and businesses continue to invent new and more exciting ways to use the Internet, industry also continues to evolve with a similar dynamic and constructive approach to accommodate ever growing demands and system complexity.
After examining several categories and examples of Internet failures from a historical perspective it is clear that network design, rapid development of new technologies and dynamic, flexible management have allowed it to continue to operate even during catastrophic events whether precipitated by human error, natural disaster or malicious intent.
The success of the Internet is partly due to its technological foundation and partly due to the freedom the industry has had to innovate. This freedom has not only spurred rapid capacity growth but also invented new technologies and created entirely new business sectors focused on meeting new requirements. Yet it hasn’t been a free ride, and several important lessons can be learned from these historical events to help the Internet grow into the integrated digital communications network of the future.
Collaboration: Policy makers can greatly assist creating a more secure Internet by fostering increased collaboration between the network service provider, computer software and computer security industries. Such collaboration will generate new technologies, procedures and relationships. All will be vital in the near future as network capacities, connectedness and our expectations for the Internet grow.
Incentive: Diversity is the best assurance for network survivability, and regional network operators should be encouraged to carefully review their physical plant and direct new construction efforts with diversity in mind. Policy makers at the federal, state and local level can all help to assure robust Internet service for their local constituents by supporting call-before-you-dig service messages and providing incentives for investment in physical diversity where possible.
Flexibility: Network service providers need the flexibility to dynamically manage their network resources, sometime on a split-second basis, to counter security threats, minimize the impact of human errors, and ensure customer’s services remain intact. In some extreme cases service restoration may require restricting non-essential applications; preserving essential voice and emergency television while attenuating web traffic for example. Regulating how networks are managed will slow operator’s ability to respond to service-affecting events and discourage the active field of research into network operational improvement.
The Internet has proven to be extremely resilient when faced with tremendous pressures. It has survived hurricanes, earthquakes, tunnel fires, and terrorist attacks with only temporary and partial loss of end-to-end connectivity. At the same time, other, seemingly trivial events like configuration errors, construction mishaps or actual malicious intent have had dramatic impacts on Internet performance and consumer services. In 2001, the Code Red worm caused more widespread Internet congestion and outages than the September 11th terrorist attacks.
While the Internet industry has successfully evolved to accommodate millions of customers, meet their demand and satisfy their business requirements, its growth in the future remains uncertain. Such growth will require massive investments and the invention of many new technologies, practices and products. Policy makers can help create a fertile environment for the future Internet with a combination of policy, incentive and active support.
The Internet of the future will be bigger, faster and more far-reaching than even the most ambitious Internet engineer could have foreseen in the early 1980s, when Arpanet was recast. To continually manage and upgrade their operations, Internet engineers, operators and inventors need a great degree of flexibility to determine the business practices that work best for them and their customers. Policy makers can also continue fostering research, development and improvement of the Internet so that service providers and other infrastructure companies will continue to invest in new technologies and implement business models that will address these issues.
Citations and References
- “Ensure the reliability, security and performance of your network“, 2006 Sprint company white paper
- “Feasibility of IP Restoration in a Tier 1 Backbone“, Gianluca Iannaccone et al, IEEE Network, March/April 2004
- “The Uncleanliness Vector: Histories of Hostile Activity“, Michael Collins et al, Computer Emergency Response Center, Carnegie Mellon University
- “The (un)Economic Internet?“, Scott Bradner, kc claffy, IEEE Computer Society, 2007
- “A Study of Settlement Peering“, Aaron Quinn, Qwest Data Planning & Engineering, Internal
- “Sprint Global Quality of Service: Guarantor of Application Delivery“, 2006 Sprint company white paper
- “A Review of Fault Management in WDM Mesh Networks: Basic Concepts and Research Challenges“, Jing Zhang etc al, IEEE Network, March/April 2004
- “Service Availability in IP Networks“, Christophe Diot et al, Sprint Advanced Technologies Laboratory
- “The Internet’s Not a Big Truck: Toward Quantifying Network Neutrality“, Robert Beverly et al, MIT CSAIL
- H.R. 5417 Congressional Budget Office Cost Estimate, June 7, 2006
- “A Brief History of the Internet“, Walt Howe, 2007, http://www.walthowe.com/navnet/history.html
- Computer Communications Review, Dave Clark, “The Design Philosophy of the DARPA Internet Protocols, 1988
- “Analysis of BGP Prefix Origins During Google’s May 2005 Outage“, Tao Wan et al, School of Computer Science, Carlton University
- “Detecting BGP Configuration Faults with Static Analysis“, Nick Feamster et al., Computer Science and Artificial Intelligence Lab, MIT
- “Failures in an Operational IP Backbone Network“, Athina Markopoulou et al, Sprint Advanced Technologies Laboratory
- “The Backhoe: A Real Cyberthreat“, Kevin Poulson, Wired Magazine, 01/19/2006
- “Increasing the Robustness of IP Backbones in the Absence of Optical Level Protection“, F.Giroire et al., Sprint Advanced Technologies Laboratory
- “Code-Red: a case study on the spread and victims of an Internet worm“, David Moore et al, CAIDA, San Diego Supercomputer Center
- “Code Red Worm Propagation Modeling and Analysis“, Cliff Changchun Zou et al, University of Massachusetts at Amherst, 2002