Problem Statement
Electronic mail at MIT
E-mail is vital to our daily lives.
I get∼500 email messages each day. The system isn’t very reliable.
It is frequently slow (minutes+).
It is sometimes catastrophically slow (hours+, days) We can improve these with transparency.
Problem Statement
End-to-end Mail System Monitoring at MIT
A proposed set of tools for instrumentation and monitoring of the MIT campus electronic mail system.
John Hawkinson
Massachusetts Institute of Technology
Problem Statement
End-to-end monitoring
Have you seen this dialogue before? Alice is a user and Bob is her local networking expert.
Alice: “I can’t reachnytimes.com”
Bob: “Let’s try topingit—nope, I can’t reach it, either. OK, let’s try sometraceroutes. . .
. . . 5 minutes pass. . .
Bob: “It looks like it was working fine,nytimes.comjust doesn’t answer pings.”
Problem Statement
My inbox
Some statistics form my mailbox this morning
3391 msgs in my inbox date from May 2008 11 messages with delivery times over 1 day 93 between 1 hour and 1 day
197 between 10 minutes and 1 hour 1471 between 1 minute and 10 minutes 1678 under 1 minute and still positive
Problem Statement
Benefits of this work
Globally
Build a suite of useful tools Locally
Encourage transparency of mail system statistics Encourage software upgrades
By highlighting flaws in the system (sunshine), increase the pressures for change.
Previous work
Existing Monitoring software. . .
. . . isn’t very good.Generally as part of general monitoring software framework Tools like nagios, mon, etc.
Connects to an SMTP port and looks for a canned response Monitors processes on the mail server(s)
Does not monitorend-to-end
MIT-specific: Existing weekly reports of average delay per server (meaningless).
Proposed Work
SMTP to IMAP end-to-end probes
Run actual test mail messages through the mail system. Inject them at the well-defined entry points
Internal mail:outgoing.mit.edu,
outgoing-legacy.mit.edu
External mail:w91-130-barracuda-1.mit.edu, etc.
Intermediate points:fort-point-station.mit.edu, etc.
Measure their delay End-to-end delay Hop-by-hop delay
Proposed Work
Details of probes
So what do we actually send in these test e-mail messages, anyhow?
Contents
Transmission time of probe
Cryptographic hash of contents to prevent spoofing Variable size payload to simulate realistic load issues
Proposed Work
Received: headers
How do we measure the transit time of a probe?
Each SMTP server adds aReceivedheader when it touches an e-mail message
Received: from SENIOR-THREE-FORTY-THREE.MIT.EDU
(SENIOR-THREE-FORTY-THREE.MIT.EDU [18.244.6.88]) (User authenticated) as [email protected]) by webmail.mit.edu (Horde MIME library) with HTTP; Mon, 12 May 2008 13:26:23 -0400
Proposed Work
Received: header analysis
I wrote a tool...rcvanal.pl
454 lines of perl
written from Nov 2003 through Dec 2007 Up to revision 1.56
Proposed Work
Sample output
from rcvanal.pl this afternoon
Message-Id: <[email protected]>
# Raw time mmm:ss fromhost byhost with
D 1209999418 ---:-- -- Date header
6 1209999448 0:30 CCR-240.MIT.EDU outgoing ESMTP
5 1209999449 0:01 outgoing bIs(dmza) ESMTP
4 1209999463 0:14 bIs(dmza) pch ESMTP 3 1209999467 0:04 pch pch ESMTP 2 1209999700 3:53 pch cccs(mailhub) ESMTP 1 1210000017 5:17 cccs(mailhub) po12(8.13.6/4.7) 0 1210000021 0:04 po12(unix) po12(Cyrus) LMTP Elapsed time 10:03 (0:10:03).
Proposed Work
Scope of work
Timeline to actually produce this
Build framework for generating probes with database1 weeks
Build framework for harvesting probes2 weeks
Build reporting infrastructure summarizing data1 week
Collect data for a long time1 month
Proposed Work
Why me?
Why am I the right person for this
Good background in running mail systems
I already do this kind of monitoring and advocacy, only by hand Maybe if I have a machine do it, it won’t be so personal
Guesstimate: 60% of the time the mail system has 4+ -hour delays, I am the first person to report it.
This is not a good way to make friends.
I’ve written articles for The Tech on the MIT mail system, so I have a forum to publish locally
Proposed Work
In closing. . .
There are no good stats on our mail system, and its performance is not good.
If we build a system to reliably monitor it, we can quantify how bad it is.