Data Center Ethernet (or DCB or CEE, depending on who you are) is a hot story these days and it’s no wonder that misconceptions galore. However, when I hear several CCIEs I highly respect talk about “Priority Flow Control can be used to stop all the other traffic when storage needs more bandwidth”, I get worried. Exactly the opposite is true: you use PFC to stop the overzealous storage traffic (primarily FCoE, but also iSCSI) to make sure you don’t drop it.
The basics
If you want to give some traffic precedence, you can do it with the existing QoS mechanisms. Let’s say you have a server that sends application (TCP/IP) and storage (FCoE) traffic over the same link; application traffic goes toward the network core and the FCoE traffic goes toward a storage array.
If you want to ensure the servers can send all the FCoE traffic they want, just configure strict priority queuing on all outbound ports (marked with QoS in the diagram). Brad Hedlund would be quick to point out that some server vendors do have certain limitations, but those aside, things should work just fine.
The demon of oversubscription
The first glitch we hit in our rosy FCoE-has-priority world is the oversubscription. Unlike the way storage vendors are supposed to be building their Fiber Channel switches (off-topic: it’s not as bad as Greg’s rant would make it seem) we usually use oversubscription when designing Ethernet-based LAN networks — if two servers connect to a switch with 10GE links, we wouldn’t necessarily use two 10GE uplinks. In our scenario, if the servers engage in a SAN fiesta, the uplink will experience congestion … and if you don’t want to lose any FCoE traffic, you have to tell the servers to stop. Note that even though FCoE has priority, we still had to stop FCoE traffic, not TCP/IP traffic.
Getting down to Earth
Most data centers serve the end users — TCP/IP traffic is thus at least marginally important and you’d want to give it some fair share of bandwidth. No problem, there’s ETS, ratifying the Weighted Deficit Round Robin (WDRR) we’ve been using for a decade.
Yet again, if the amount of FCoE traffic is low, there’s no need to worry — WDRR will give it all the bandwidth it needs. When the volume of the FCoE traffic increases, we could either push other traffic aside (bringing us back to FCoE has strict priority scenario we’ve just discussed) or tell the FCoE traffic to stop to prevent packet loss.
Could stopping TCP/IP instead of FCoE help? Not really, we would achieve the same effect as making FCoE traffic higher priority (see above).
Would it help to be able to stop TCP/IP traffic if it congests the network? Sometimes. Making TCP/IP lossless increases the overall throughput (that’s the reason you should make iSCSI lossless), but messes up TCP’s congestion avoidance mechanism that relies on occasional drops.
Dirty details
Let’s mention one more technical detail: implementing PFC is expensive. You have to send PAUSE frames on input ports when output ports become congested. Unless you want to block all input ports when a single output port is congested (not a good idea), you have to build a sophisticated queuing mechanism similar to Virtual Output Queues on the Nexus switches, and be able to detect looming congestion well in advance (due to round-trip times between switches, PFC takes a while before the inbound traffic stops). That’s the reason Nexus 5000 supports only three PFC-enabled classes with strict MTU limitations in the current software.
For more details, see the Priority Flow Control: Build Reliable Layer 2 Infrastructure whitepaper from Cisco.
More information
- Introduction to 802.1Qbb (Priority-based Flow Control – PFC)
- Introduction to 802.1Qaz (Enhanced Transmission Selection — ETS)
- Data Center 3.0 for Networking Engineers webinar (register here) describes numerous Data Center technologies, including DCB, FCoE and iSCSI.