This article is a sequel to “Catchpoint Excels at Internet Resilience” about Catchpoint and its drive to be best at Internet Resilience. That article briefly discussed what that effort entails, and then walked through a real world example of troubleshooting slowness in an Internet-based application.
There is something major that was deliberately missing from the prior article: BGP, and BGP monitoring. Which just happens to be the topic of this article.
Goals for this article:
- Explain everything about BGP in one paragraph (hardly – there are books doing that)! Cover enough basics so someone new or just learning BGP can (I hope!) follow the rest of this article.
- Show a sequence of screens from Catchpoint, to give you feel for how it shows what it has learned about BGP, and how it might be useful to you in troubleshooting BGP and problems with your organization’s Internet presence or access to SaaS/Internet-based apps. If you’ve ever had the the pleasure of trying to diagnose BGP from the command line, having the data already available and captured over time is incredibly better!
Very Basic Intro to BGP
BGP is a routing protocol with lots of controls. It is used between different organizations on the Internet, and may also be used internally by large organizations.
BGP routing involves establishing routing peers, which are manually defined (security, trust). They exchange selected routing prefixes. The prefixes may be summarized (many rolled up into a summary), and filtered (being selective about which prefixes a router tells its neighbor about).
Each separate BGP organization has an assigned Autonomous System Number (“ASN”), which allows tracking which BGP router was first to advertise a prefix. The routing accumulates the ASN’s along consecutive routing hops, so BGP also tells you which routers passed along a given routing prefix.
BGP is a great protocol for handling large numbers of routing prefixes. It used to be simple but is rapidly acquiring added features, which increases complexity.
Where necessary, BGP can handle the hundreds of thousands of prefixes routed by an ISP. Best practice is to not advertise little subnets of a given network, but summarize prefixes. You can think of it as outside the U.S. having one route to “Company X, U.S. Division”, rather than tracking how to get to each building. Inside the U.S., more specific details would be routed.
Time to change the subject before I dig the technical hole deeper here!
Why do We Care about BGP?
Well, BGP is the routing used by the Internet.
When things go wrong, traffic from some sites cannot reach certain sites, or is intermittent or slow, or even redirected to a hacker’s site.
So BGP problems cost money (lost use of Internet/SaaS apps, or lost customer access to your business website). There is also risk of security issues, etc. Furthermore, optimizing BGP routing can make sites appear faster, saving employee or customer time and providing a better application experience.
Another perspective is that modern applications, SaaS and other, can go all over the place pulling in chunks of web pages. Some of those chunks may be slow to load, impacting overall performance and perception (“What a slow website! Losers! I’ll shop elsewhere…”). Gotta be fast!
So slow or broken Internet routing, especially as caused by BGP problems, costs an organization time, money to fix, and perception/reputation, which takes lots of money to fix.
You don’t want to be known as that site with flaky e-commerce site. You also don’t want to have many employees unable to use or experiencing significant slowness accessing a SaaS application needed for doing their jobs, with ensuing low productivity.
Potential BGP Problems
So: what could possibly go wrong with BGP?
Here is a list of some/most of the major things that might go wrong and should be monitored. For more details, see Catchpoint’s page on BGP Monitoring.
- Loss of BGP peer/peer not coming up. (See also BGP Troubleshooting Cheat Sheet: https://www.catchpoint.com/bgp-monitoring/bgp-troubleshooting-cheat-sheet — what to check when a BGP session to a peer is down.)
- Performance and instability: When BGP to a peer gets slow, unreliable, or unstable, that can cause routing problems.
- Route leaks: Something is advertising routes that it should not be advertising, perhaps not properly summarizing prefixes.
- Route hijacking: A site claims to be the origin of some prefix(es), advertising them with its own AS number as source. Your traffic is going to the wrong place (security alert)!
Catchpoint’s BGP Monitoring
How does Catchpoint monitor BGP?
Catchpoint has many route collectors peered with BGP routers at major ISP’s, etc., and leverages public route collectors as well. See the graphic at the end of this article for numbers.
Such a BGP route collector receives advertised routing prefixes, but does not propagate them nor advertise any prefixes of their own. They also send beacons to check BGP convergence time – how fast BGP stabilizes after losing/gaining a prefix or peer(s).
This enables Catchpoint to track the following for a set of prefixes:
- Availability and downtime – can packets get to that destination?
- Withdrawn and restored routes – track which routes vanish and come back: which routes are no longer reachable, or are flapping (disappearing and re-appearing from BGP tables).
- BGP peering stability, up and down time, etc.
- Where the BGP path (“AS path”) takes packets, through which ISP’s.
- Changes in BGP path over time.
- RPKI (Resource Public Key Infrastructure) data – crypto signed objects for which AS is authorized to originate a given prefix. If RPKI is failing, then affected advertised prefixes might well be unusable, not trusted.
BGP Dashboard: What a Catchpoint User Sees
We’ll now take a quick tour of how Catchpoint displays the BGP data it has, via the BGP Dashboard. And how you can use it.
The starting point is the BGP Overview dashboard, shown below.
There is a lot of information packed into this screen.
At the top left is a widget tracking RPKI status for specific prefixes. The count of valid/invalid/not configured is shown.
The rest of the top row summarizes:
- Reachability. What percentage of peers are reachable.
- # Hijacks. How many prefixes have been hijacked. (Advertised by a malicious or inadvertent 3rd party, preventing some or all traffic from getting to where it should be going.)
- # Neighboring Peers: How many direct neighbors are up/active.
- # Prefixes Withdrawn: How many prefixes were withdrawn (were present and removed by a peer).
This provides a quick overview that highlights any of the simpler problems that might arise.
Each widget shows counts for Catchpoint private collectors, RIPE, and RouteViews public collectors.
The next part of the dashboard indicates data about neighbors and origins (autonomous systems). The right-most column indicates percent changes in number of peers, and number of route prefixes learned per peer.
Scrolling down, the next item shown is a summary graphic of tests and status, with a block per prefix. Green is good. This is another way to quickly spot problems (non-green!).
Below that is a world map view, which helps see the scope of any reachability problems (cities, countries, continents). That’s self-explanatory.
BGP Per-Prefix View
Clicking on a prefix or dot in the map brings up a per-prefix view, shown below. There is a lot of information packed into this view!
Along the top is a timeline, showing various status metrics (drop down menu) at various times. Above that are settings for what time period, etc. that you can change if needed.
Below that, the display shows specifics about the prefix and its reachability and stability in the routing table.
The bottom shows statistics over time. You can choose to see path changes or all paths. Dragging to select a time period gives a comparison below the chart.
Scrolling further down brings up what I call the “train tracks view”.
This shows the sequence of AS’s (autonomous systems, sites) traversed to get from the collector to the chosen prefix. Prior paths are also shown in red.
You can easily select different time periods to see how the path has changed over time. For example:
If you like, you can click on the “Peers Table” tab (up top left) to see data per-peer for the selected time interval. This is show below.
For example, selecting a 24-hour time interval and then this tab shows you a summary of that day.
BGP Events Screen
When troubleshooting or monitoring, it can be helpful to see what has been happening from a BGP protocol perspective: BGP events. This is the third tab (top left).
Sliding the gray box or dragging to select a time interval in the timeline easily brings up all BGP events for that time period.
An “announcement” is a BGP advertisement, triggered by some change somewhere. The table tells you the prefix involved, what the next hop and AS Path were, associated BGP communities (used for filtering prefixes), etc. In the graphic, note that the next hop changed in the fourth and fifth rows. That means there was some change “out in the Internet” and the collector then routes to the prefix via a different BGP neighbor (next hop).
Summary: Why Catchpoint?
We’ll wrap up with a slide from the video presentation. It pretty much explains itself. Note that Catchpoint has a large number of BGP peers, tracking 398 ASN’s.
Also please note the “real-time” item, a unique Catchpoint feature. This is Catchpoint peers in key Internet locations that are tracked in real-time. For external peers (RIPE, RouteViews), there is delay because those peers have to scale to support many external entities.
Summing everything up, Catchpoint’s BGP dashboards provide actionable, time, easy to understand information. The timelines provide the ability to easily see “what changed”, which can be painful to do any other way. The “train tracks” diagrams show AS path changes over time, and are more informative than trying to compare text-based AS path listings.
Conclusion
Catchpoint can monitor and display valuable BGP information over time, allowing staff to keep an eye on, spot problems in, and troubleshoot global or large scale routing. To learn more about Catchpoint, you can watch their presentations from Networking Field Day or visit their website, in which Catchpoint has just revised and updated a Comprehensive Guide to BGP. Recommended as a learning tool for those new to BGP, and a good review for others. It expands on what was necessarily summarized above.