I got a panic stricken phone call from a customer at 18:00 on Wednesday evening.
They had over 10,000 inbound messages stuck on the SMTP bridgehead server that were not being delivered to mailboxes. Outbound SMTP was working fine. Inbound wasn’t.
We worked out that the last inbound message has been delivered at about 11:00 on Wednesday, so assuming the default, by 11:00 on Friday these messages would start to be NDR’d.
We had a bit of time, bit not a lot.
Oh, and I was 200 miles away, on another site. It was going to be one of those nights.
So, first, what’s changed. Well nothing, apparently. I don’t believe that things ever just stop working, but trying to find anything that has changed is a fruitless excercise sometimes.
The SMTP bridgehead is in a DMZ, so we open up the firewall, and permit any any, but no luck.
We telnet to port 25 on the internal mailbox server, and sure enough we get a 220 ready.
We stop and restart the Default SMTP virtual server, and it takes a while, but it starts with absolutely no problem.
We increase diagnostic logging, and try again, and sure enough, no problem, no errors.
Time to get a little curious. SMTP seems perfectly happy.
We check the c:\program files\exchsrvr\mailroot\vsi 1\ directory structure. Badmail is empty, Queue is full.
This is good, it means if we can kick start SMTP the mail is not yet lost.
I ask them to check which Queue the message are stuck in. Its messages awaiting directory lookup….
This is the Active Directory, not DNS, but just to be safe we ping the internal server using it’s IP, and name, and FQDN, and yep no problem.
This is one of the servers I built years ago, so I know the Support Tools are installed.
We run LDP, and we connect and bind to AD. I can even run eventvwr and open the event logs on the internal server, so we know RPC is working fine.
In a fit of hunger fueled desperation, I suggested we cross a bridge we would have to cross at some point anyway, and reapply SP2.
This gives me time to eat 🙂 I ordered? a pint of Stella, and a chicken, bacon, brie and cranberry burger.
I consulted the oracle. I googled for messages awaiting directory lookup.
I came up with these very usefull links…
Messages awaiting directory lookup. This queue contains messages with recipient addresses that have not been resolved against the Active Directory. Messages are also held in this queue while distribution lists are expanded
Article ID: 251746 How to troubleshoot messages that remain in the “Messages awaiting directory lookup” queue in Exchange Server 2003 and in Exchange 2000 Server
Article ID: 884996 Messages remain in the “Messages awaiting directory lookup” SMTP queue in Exchange Server 2003 or in Exchange 2000
Article ID: 308350 Problematic message may be continually retried and may hold up other messages in connection queue
I also found this extremely detailed blog entry
I sent all of this to my customer. When I rang them back, after going thru all of the troubleshooting steps, we were still no closer. As seems always to be the way, we had found?twenty references to a cause for this problem, but we had the twenty first!
In the mean time my customer had run the ExTRA against the server. This had told him that there was a corrupt log file, and he was trying to recover the IS off tape.
Now this was brilliant news.
I asked him to open AD users and computers, DSA.MSC
Right click the domain object, and choose Find…
With the default of Users, Contacts, and Groups switch to the Advanced tab
Drop down Field to User and select Exchange Home Server
Change the Condition: filter to Ends with and in the Value: enter the name of the SMTP bridgehead.
In a situation where the IS isn’t up, and you can’t simply expand the Mailbox database, and check the Mailboxes container, this is the only way to determine how many mailboxes are actually on the server.
There were none listed.
There are actually three mailboxes, SystemMailbox, System Attendant and SMTP, but these are all system mailboxes, they don’t show up in the AD LDAP search above, because the mailboxes don’t belong to AD users.
Now, the SMTP mailbox is used for generating NDRs.
If you run SMTP, and you want to generate NDR’s you need a mailbox database to be mounted, which is one of the reasons I don’t like to run SMTP on an FE server. I like to keep the FE server purely for HTTP/S traffic, but I digress…
This was brilliant news.
I explained about message dial tone recovery.
Net Stop MSExchangeIS
Rename all of the MDBDATA directories to MDBOLD, this way we still have the original database and logs if we need them.
Create a new MDBDATA directory anywhere we have renamed the old one.
Net Start MSExchangeIS
Right click Mailbox store, and Mount Store
Ignore the error and press OK.
Presto, 10,000 messages started to clear instantly.
Now what’s the lesson we learn here.
We would not have had this dial tone option, if the SMTP mailbox was on a mailbox store database with live users…
The quick win option was only available here, because the SMTP mailbox?was in it’s own storage group, there was nothing else that shared the transaction logs, so we had the option to sacrafice them and any content in the database.
How much does a dedicated SMTP bridgehead server cost.
How much even would a storage group dedicated to SMTP costs.
Compare that to the price of 10,000 lost e-mail…
My burger was cold…
How much did that cost me, ?8.75…