The curious case of seemingly random SPF failures in an Exchange Hybrid

Or “How I found out about – Tenant Attribution – and what the heck it is…”

I’ve been working with a client of mine to deploy a fairly decently sized Office 365 migration track, and one of those workloads that we are migrating is (you guessed it!) their Microsoft Exchange environment. And like all projects, it has its ups, its downs, and its roadblocks. I’ve been doing this for a while now, so I’ve seen quite a few of them… But this one I never encountered before.

The Situation

Our lovely customer decided that they wanted to host their hybrid on existing Exchange servers (which is fine, these servers weren’t actively hosting mailboxes anyways), want all our their inbound and outbound mail to route through their on-premises smart hosts, and apply whitelisting to the Office 365 URLS, the mail gateways have certificates for TLS (but do not sit in between EXO and the hybrid servers). So far, so good. No weird requests for now…

All this leaves their environment to look a bit like this:

 

Now pictures might speak a thousand words, but let me sum this up for you:

  1. The Exchange Online Tenant
  2. The Exchange Online Protection Tenant linked to the Exchange Online Tenant. Yes, even if you don’t use EOP, you still get EOP. 
  3. An External mail system that receives (or sends) email to the customers environment…
  4. The edge/DMZ network mail gateways
  5. The hybrid servers
  6. The main Exchange infrastructure
The Green lines show our Hybrid SMTP connector (notice how it goes through EOP?)
The Red lines show outgoing SMTP traffic (coincidentally, this is also the path for incoming email)
The Blue(ish) line represent mail flow between the main infrastructure and the hybrid servers.
 

Centralized mail transport
Centralized mail transport is this nifty little option hidden away in the Hybrid Configuration Wizard that allows you to specify that “All in- and outbound email should traverse the on-premises mail systems“. This is usually very handy for customers that have some sort of compliance reasons to have mail flow that way, and it doesn’t require a lot of special configuration… Now this customer had been really good and set their SPF records properly. All email flow worked perfectly fine until the day we ran the Hybrid Configuration Wizard.

The issue at hand

So just now I said everything was fine until we ran the HCW. Remember? Well after we ran the HCW, we started getting reports of outbound email being rejected due to SPF failures. This was odd, since nothing had changed on the way that outbound email was sent. We were a bit dumbstruck. Yet it gets better… This was not happening to all outbound email. Oh-No… That would have been too easy! It was happening to seemingly random domains!

At that point I did what any one of us in this situation would do. I picked myself up from the floor, and asked for logging. Specifically email headers for failed mails, and attempted to recreate the problem and analyze the heck out of it!

Narrowing it down

After some extensive testing (sending mails to GMAIL can be considered extensive, no?) we determined that there was a pattern to these SPF failures: They all happened to recipients that were hosted on Exchange Online. The really curious piece is that this was not happening to all recipients hosted in Exchange Online. Only a subset of them.

When analyzing the headers of those emails that hard failed on SPF, we noticed a similarity: They all had the tenant of the customer in the header. Mails that got successfully delivered did not have the customers tenant in the header.

Now this is particularly odd?! Why the heck is the customers tenant being placed in the header, when centralized mail transport is enabled? 

Expected behavior

The routing behavior we expected to see (and were seeing on most Exchange Online hosted recipients!) is illustrated below:

 

  1. An email gets send from our on-premises environment (or comes to our on-premises via the hybrid servers…). Since we’re using mail gateways, Exchange routes this message to the mail gateways living in the DMZ
  2. The mail gateways do their DNS lookup, see that the mail should be routed to EOP, and establish a connection to deliver the mail (TLS secured, since this is enabled on the mail gateways!)
  3. EOP receives the message, goes throught its rule list and delivers the message to Exchange Online
  4. Exchange Online delivers the message to the intended recipient…
Random behavior causing SPF hard failures
Now what we were observing on some recipient domains:
  1. An email gets send from our on-premises environment (or comes to our on-premises via the hybrid servers…). Since we’re using mail gateways, Exchange routes this message to the mail gateways living in the DMZ

  2. The mail gateways do their DNS lookup, see that the mail should be routed to EOP, and establish a connection to deliver the mail (TLS secured, since this is enabled on the mail gateways!)
  3. EOP receives the message, goes throught its rule list and for some reason of black magic decides this email should be sent to the customers EOP tenant????
  4. The customers EOP tenant decides this email is not for him (doesn’t match any of its inbound connectors) so routes the mail to the appropriote tenant.
Now what does that EOP tenant do? Well it does an SPF check. And with SPF checks the last hop the mail was received from should be authorised to send said email. Since we are using centralized mail routing, we did not change our SPF record, so the only authorized servers are the mail gateways. Of course EOP throws a fit, wags its little finger and says “Ah Ah Ah, you didn’t say the magic word!”, rejecting the email with a hard SPF failure…
 

What the heck?!

Still no explanation as to why this is randomly happening to some tenants in Exchange Online, but not all of them. At least we’re tightening the noose, right?

 Now the next bit of information needed cajoling, begging, and pleading. (Basically we asked Microsoft to tell us what was happening…) and we found out 4 things:
  1.  EOP tenants are hosted in different “forests”
  2. The first (or one of the first) rules to be processed is wether or not this is an email to on-premises (determined by the name on the TLS certificate)
  3. These rules exist on the forest level, not the EOP tenant level
  4. What we are experiencing is an “Undocumented Feature” called Tenant Attribution
Ah! Now we’re getting somewhere! Let’s recap…
  1. We’re using centralized mail transport
  2. We did not update our SPF record to include spf.protection.outlook.com (because we’re not using EOP to send emails…)
  3. Our mail gateways have a certificate with .contoso.com as the subject name.
  4. Our inbound connector from EXO to on-premises is listed as *.contoso.com 
  5. EOP always sits before Exchange Online, even for the on-premises hybrid connector
  6. EOP does not apply fancy magic to mails destined to on-premises
  7. EOP determines a mail is for on-premises by the TLS cert subject name?
  8. EOP does not process further rules if the on-premises rule is matched
  9. Rules live on the forest level in EOP
  10. There are many forests…
Taking all of the above, shaking our brains wildly, we can come to the following conclusions:
Expected behavior: The recipient hosted in Office 365 does not have its EOP tenant in the same forest as the customers EOP tenant.
 
Failure behavior: The recipient hosted in Office 365 has its EOP tenant in the same forest as the customers EOP tenant, thus the connector rule for *.contoso.com is matched (since it’s a wildcard), and the mail gets routed the the customers EOP tenant to deliver on-premises. Since this mail is not destined for on-premises, the customers EOP tenant sends it to the recipients EOP tenant, stamping the headers with it’s IP as a hop, causing SPF to fail!
 
Victory! Ah, no… Not quite…
 

What are my options?

So when you’re faced with this scenario, you have a few options:
  1. Do nothing: Maybe you don’t care about the recipients receiving those emails. Maybe you’re planning to move to EOP in the near future and you just can’t be bothered… Personally, I couldn’t do it, but kudos to you if you could just sit back and let these poor messages be slaughtered!
  2. Disable the inbound connector: Disabling it causes the rule to be no longer applied, and mail will flow correctly again. Unfortunately this will also break mail flow between Exchange Online and on-premises…
  3. Add include:spf.protection.outlook.com to your SPF record: Possibly the easiest solution you have. Adding this does not mean that someone in Office 365 would be able to spoof your domain. There are checks and balances built in to the service to avoid anyone who does not have the domain added to the tenant to send email as that domain. So it’s a low impact solution… But some security teams will balk at this, and why wouldn’t they? After all, this is a fix that should be unnecessary from a logical point of view
  4. Add hybrid.contoso.com as a subdomain to O365 and rerun the hybrid configuration wizard: Theoretically, this should resolve the issue as the inbound connector would no longer get stamped with *.contoso.com, but with hybrid.contoso.com. That way, the EOP rule would not trigger since the mail gateways do not have a certificate with hybrid.contoso.com. (If they do, the problem will not go away…). Unfortunately I have never tested this, so I don’t know the long term repercussions of this method. 
  5. Wait for Microsoft to fix this: I don’t think they will, considering there’s a low impact solution. But you can always try holding your breath :)!

Conclusion

In the end we figured out what caused our problem, discovered an undocumented feature, and had some major fun doing so! At least I had… I live for these weird fringe cases that let me activate my little grey cells…

I’m very likely going to advise all future customers to add the “include” option to their SPF record, and now I can explain why this is needed! 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Close Menu
%d bloggers like this: