ACE and OTV and other layer 2 woes

So I have a client and they have two data centers. We’ve deployed OTV for VMotion, NFS, and all that jazz. They bought two ACE’s, one for each data center. I thought, no big deal. So I rack them and throw a management IP on. So far so good. Next I throw the ft (fault tolerant) vlan into OTV. I test connectivity between the two ft IP’s and they each respond. I’m thinking this is almost too easy….and I’m right, it is to easy. I get all the other fault tolerant settings configured on the backup, then the primary and voila, I get COLD_STATE. I review the excellent troubleshooting guide here and for some reason the HA pair will not come up. I bet I rebuilt that ft config 20 times. I read and re-read the configuration guide, the troubleshooting doc, everything. I was checking the logs and found something interesting-

%ACE-4-106023: Deny udp src vlan510: dst undetermined: by access-group “#FT_VLAN_ACL#4#” [0xffffffff, 0x0]

So I check the System Messages for the ACE and it was pretty much worthless-

Explanation    An IP packet was denied by the ACL. This message displays even if you do not have the log option enabled for an ACL. If a packet hits an input ACL, the outgoing interface will not be known. In this case, the ACE prints the outgoing interface as undetermined. The source IP and destination IP addresses are the unmapped and mapped addresses for the input and ouput ACLs, respectively, when used with NAT.

Recommended Action    If messages persist from the same source address, messages might indicate a foot-printing or port-scanning attempt. Contact the remote host administrators.

The strange thing is that I have no ACL’s configured, but the log shows one. Come to find out it’s one of those hidden system ACL’s to protect us from ourselves. It’s filtering PIM on the ft VLAN. Cisco’s docs say that ft communication uses telnet between the two boxes on the ft VLAN IP’s. It looks to me that it also uses some multicast. I didn’t have time to throw a sniffer on there though so I can’t see exactly what’s going on.  I did some more searching around and I found on CSC (Cisco Support Community) that someone else was having a similar problem. A guy from Cisco replied something very interesting. He basically stated that OTV was not developed for use for extending layer 2 services such as fault tolerance (ACE, ASA. etc). He continued to say that OTV was designed for layer 2 “bulk workloads” such as VMotion. I assume that we’re not supposed to use OTV for fault tolerance because it’s unreliable for latency and split brain flapping would happen often. Luckily I had a layer 2 connection between the sites. I added the ft VLAN to the L2 trunk and removed it from OTV. The two boxes HA paired quickly and I was in business. Unfortunately it took about 7 hours for me to figure all this out. Per Cisco’s CND for application services, the correct design is to buy four ACE’s. A pair in each data center. You would then use GSS (Global Site Selector) to load balance across your load balancers!