Amazon Web Services (AWS, AS16509) is the leading cloud infrastructure service provider, managing around one-third of the market share. With more than one million customers, including many Fortune 500 companies, making sure its network is secure is a constant priority.
One fundamental security element it has enacted in the past five years to secure its routing infrastructure is deploying Resource Public Key Infrastructure (RPKI) — a method of cryptographically assuring its network is authorized to originate routes to a specific set of network addresses.
Fredrik Korsbäck, Senior Network Infrastructure BD – IP & Edge at AWS, recently shared some interesting lessons in a Paths to MANRS webinar to help others in their own deployment and potentially also become MANRS participants.
Deploying ROAs and ROV at the Same Time
“It’s important to note that RPKI is just one component of our routing security strategy, which has always been a high priority of our networking leadership and is always evolving,” said Fredrik. “Having this buy-in at a managerial level is crucial to have projects like these succeed.”
Deploying RPKI has two distinct stages:
- Signing your route information that you advertise in Border Gateway Protocol (BGP) by issuing Route Origin Authorizations (ROAs).
- Validating the cryptographic signatures of other networks’ route information via Route Origin Validation (ROV).
End-type networks, such as enterprise companies — get the most benefit from signing ROAs for their routes so that their upstream can validate their traffic. For large carriers, such as NTT, Arelion, and Cogent, it’s more important for them to enact ROV as they provide transit for most Internet traffic and can stop hijacked routes from being propagated.
“Because cloud networks bridge the two, it’s equally important they sign all their routes and drop all invalid routes at the border,” said Fredrik.
“That’s why we did both things at the same time. Even though they were done by different teams, and they are essentially different technologies deployed in different parts of the network, together they are pillars of our routing-security strategy.
“This required several teams to work in parallel with the network and software engineer teams, IP management team, and customer accounts among others.”
The whole reason why AWS uses RPKI, said Fredrik, is to avoid its customers’ sites and services from being hijacked.
“Keeping our customers informed has been essential,” explained Fredrik. “Some of our very large customers questioned the value of us spending so much time deploying RPKI, instead of working on their requests. We explained that if we and others don’t stay ahead of the game and secure the fundamental protocol we use to connect to and route traffic through the Internet (BGP) then we leave ourselves open to route hijacks.”
Our RPKI Deployment is Probably the Largest in the World!
AWS has tens of millions of IP addresses allocated from all five Regional Internet Registries (RIRs), including some /10s, /11s, and /12s, as well as a bring-your-own-IP (BYOIP) service, Fredrik explained it has a “central vending machine” for IP addresses.
“If one of our teams needs to build a service or tool for a customer, it would request a network allocation from the vending machine, which is returned to the pool once the project is complete,” said Fredrik. “We handle the ROAs centrally and make sure they are up to date and published under our delegated RPKI Repository accordingly.”
The mechanism of the vending machine acts like an Internet registry and keeps track of who is using certain IP addresses as well as updating the ROAs for those prefixes. Because teams are slicing big allocations this requires new ROAs to be created for each new allocation instead of them being covered by AWS’s larger allocation ROAs.
“Initially, this task of creating ROAs and updating them in the relevant RIR registries was manually done by a person named Jon,” said Fredrik. “As you can imagine this became unsustainable, so we created a program to automate the process, which we named AutoJon.
“However, AutoJon is itself being made redundant as we’re moving everything into delegated RPKI. This will help us overcome problems we’ve had over the years given the size of our publication, and to some extent the speed we publish ROAs — we publish and delete ROAs too fast, which essentially puts a strain on the RIR system and has taken it down more than once.
Having its own delegation system will also allow AWS to explore building a distributed rsync service — it has experience running several Certificate Authorities (CAs) for webPKI.
Dealing with BYOIP
Fredrik explained that “if you bring your own IP, AWS enforce the use of RPKI ROAs — you cannot enroll IP space that is yours in AWS in the cloud, or wherever you want to put it, without ROAs.”
There are two stages in enrolling IP space with AWS:
- Validating whether AS16509 is allowed to announce your IP prefix. As part of this, AWS checks if there’s already a ROA created for it. If yes, it can announce it. If not, it must have the customer create a ROA for it
- Assigning the prefix to an AWS account. This indicates who controls this prefix and logs an X.509 certificate for that prefix in the routing records.
“We don’t like that solution but it’s the only way we’ve been able to do it,” said Fredrik.
“We have been supporting new RFC drafts such as RPKI Signed Checklists (RSCs). As soon as we see support coming in from at least RIPE NCC and ARIN — around 95% of our BYOIP customers come from these two RIRs — we will start using RSCs as they allow us to insert AWS user accounts in the files so they can cryptographically publish these and we can verify them with the help of the RPKI ecosystem. We’re looking forward to that.”
We Have a Little Bit of Inbuilt Slowness in the System
Like everyone who deploys something new, AWS was wary of what happens if something breaks. One example that Fredrik mentioned was if a lot of ROAs were published with AS0 notation, which essentially makes RPKI not fail because that’s how they would like it. “That’s why we have a lot of what we call velocity checks,” said Fredrik.
“We also didn’t want to see examples where there was a massive publication or unpublication of ROAs. That shouldn’t happen, so we have an anomaly detector in place to check for these kinds of things, which will postpone publication until a human has looked at it. So, we have a little bit of inbuilt slowness in the system.”
Use the Expertise You Have at Hand
The first lesson Fredrik shares with people interested in deploying RPKI is no deployment is the same.
“When I started at AWS, I asked if we could get this going in two weeks, which is how long it took at previous companies,” recollected Fredrik. “But when you’re working at one of the largest companies in the world, it turns out that’s simply not doable as I’ve outlined above.
“For example, most of the people who have built AWS’s system had never heard about our RPKI before it was put in front of them. But they are good software developers and infrastructure people. So, they just look at this as any other type of computer plumbing system. It doesn’t matter if it’s a version control system, or if it’s a global RPKI you need to consider what the failure modes are, what happens if this breaks, and what happens if it breaks at the same time as something else breaks.
“Looking at it from a holistic point of view, I think, has been, at least for me, extremely valuable.”
Routing Security Requires You to Always Look Ahead
Fredrik acknowledged that it’s been great to see the growth of RPKI deployment over the last five years and the maturation of the routing security ecosystem that it sits in and supports it. An important development as part of this that he sees has been the evolution of RIRs from IP and ASN bookkeepers to a part of the Internet’s critical infrastructure.
“One supporting system that I feel needs further attention is the Internet Routing Registry (IRR), specifically how we as a community handle fake IRR objects,” said Fredrik. “I was happy to see the change made last year in the RIPE database working group on how to handle AS-Sets.”
In terms of technology, two routing security concepts coming through the standards processes that have Fredrik and others at AWS excited are:
- Autonomous System Provider Authorization (ASPA) objects, which state ASNs are allowed to propagate their routes. From a technical view, these objects will help AWS overcome some parts of the origin or path validation problems it has today. However, from a business perspective, it is quite contentious as you essentially express the business relationships you have with certain ISPs. “We absolutely believe in ASPA and would be one of the first to publish them, but we need to express what relationship we have with certain ISPs, and this is not always clear-cut,” clarified Fredrik.
- Border Gateway Protocol Security (BGPSEC), which is about protecting path validation — the signatures you see on the BGP path show that you know for certain the message is passed between the AS-to-AS, as is. Fredrik said AWS is still exploring the merits of BGPSEC, which he feels the industry needs to view as how it did RPKI in 2016 when fail open made it more appealing and achievable. “It is cryptographically heavy, and it is hard to implement and most likely unachievable to get it deployed Internet-wide but it could be a much-needed upgrade for certain interconnect-relationship with able networks,” explained Fredrik.
Learn more about how AWS is helping to secure Internet routing.