Interview with Jason Hunter of MarkMail
Jason Hunter is a Principal Technologist at Mark Logic and heads development on MarkMail, a new way of searching email by first choosing a broad topic and then drilling down into the results to refine the search as you go. The Perl Review interviewed Jason in May 2008, after MarkMail had loaded 530,000 messages from 75 Perl mailing lists.
The Perl Review: For MarkLogic, you use Java to convert email to an XML format. Are you using Perl internally for part of the process? Was Java your first choice? It's certainly a fine language in the hands of skilled developers, but usually not what I expect someone to put at the top of the list for text munging.
Jason Hunter: As you know, inside the MarkMail system we hold every email as an XML document. This works well for us because we use the MarkLogic Server to store and query the messages. MarkLogic uses XML as its native datatype and has lots of indexes designed for XML content. The XML schema we use lets us represent things like quote levels, footers, signature blocks, attachment contents, and message headers in a very convenient, natural, and efficient way.
We use Java to do the conversion between the raw RFC 822 mail format and our XML format, for a few reasons. Most importantly, we're using a Java-based mail server to catch the mails and a Java library to push the mails into MarkLogic Server. Using Java in between lets us handle the conversion in-process. Why did we use Java for the bookend technology? We decided to use Apache JAMES for our mail server because it provided us with some necessary flexibility, and to use MarkLogic's Java-based XCC access API because it's one of the officially supported libraries (the Perl access library is developed by the community and less mature).
TPR: I hadn't heard of XCC before. That's the MarkMail XML Content Converter, an in-house project, right? All google hits seem to lead back to you.
Jason: Correct, XCC is the library that lets you connect to a MarkLogic Server instance from your programming environment.
TPR: Poking around, I found http://xqzone. marklogic.com/svn/libmlxcc/trunk/, and it looks like there are at least the starts of APIs for Perl, Python, Ruby, and PHP. Are you looking for community input or submissions? What can those communities do to help?
Jason: Mark Logic provides and supports XCC libraries for Java and .NET. On the Mark Logic Developer Site (also aliased as http://xqzone.marklogic.com) we host various open source projects related to the server. Among those are XCC connectors for other languages.
TPR: The Perl project, written by Andrew Bruno under the Apache License, is a SWIG wrapper around mlxcc, also written by Andrew. Is that a community project, or something from within MarkMail? Is there an officially supported library the open source communities could use to make language bindings?
Jason: What I particularly like about the mlxcc project that Andy Bruno created is that it's a code-once/reuse-many approach because it has the core functionality in C code and uses SWIG to support languages like Perl, Python, Ruby, and PHP. Andy wrote the library when he worked at O'Reilly Media, so they could run content transformation pipelines driven by Ruby and run analytics driven by Perl.
TPR: Were you able to use existing Java libraries for the basic XML handling and creation, or did you need to make your own?
Jason: Ha, both! This question gave me a chuckle because we used JDOM to help with the XML handling, and JDOM is both a pre-existing open source library and also something I built myself (acting as the JDOM project lead).
Of course JDOM has nothing to do with email, it's just an object model making it easier for Java to do XML work. When we started MarkMail we looked if anyone had written a robust email to XML converter and found no one had. I say "robust" because handling the mess that is the real world provided the biggest challenge in authoring the converter. We haven't open sourced what we wrote because it turned out to be very targeted to our needs.
TPR: What was the hardest challenge in converting a message to XML? What did you think would be hard but actually wasn't?
Jason: Mailers lie. They tell big whoppers. They tell you the MIME boundary will be one thing when in fact it's another. Or they tell you the charset is something that just doesn't exist. They send malformed headers, illegal characters, and bogus dates. The standard email parsing libraries don't handle real world emails as well as we wanted, so we've had to extend them somewhat.
I thought handling attachments would be harder than it turned out to be. We managed to create a reliable pipeline to extract the text and structure from PDF, PowerPoint, Word, and Excel files, as well as others. So now on the site you can search attachment contents and (better still) view the attachments inline with your browser, not needing an external file viewer. If I do say so myself, it's pretty slick, especially because by understanding the attachment structure we can give a red underline to any page having a search term hit. If only one slide in a deck has a hit, we'll show you which one.
TPR: One of my messages is at http://markmail. org/message/47vhrf3mphalquqj. That looks like a unique identifier from MarkMail, rather than the message ID from the original sender. This leads me to a few more questions. Have you found that the original messages IDs are unreliable, or even not unique? For instance, people can put whatever they like in that header. How much of a problem is that?
Jason: Your guess is right: Message-ID headers are useful but not completely reliable. Many mails in the world, especially the dusty mails we gather from hodge-podge historic archives, don't have Message-ID headers. Some other lucky mails have two!
TPR: Is there a way to go from the original email message to the MarkMail ID, which could then lead to the messages around the original? Right away I'm thinking about a tool that can take any message in my inbox and put me in the middle of the thread in MarkMail.
Jason: Something like that would be cool, wouldn't it? It's technically possible.
TPR: Do you keep the original message ID around? Maybe I can't compute the MarkMail ID, but maybe I can look it up based on the original message ID. Sometimes I might get no results, or multiple results, but I think most of the time I'd get the result I was looking for if the message is in MarkMail.
Jason: We don't throw away anything. That's one of the perks of modeling email messages as XML. Each header is just another queryable XML element. Do you think people want a "lookup by Message-ID" feature? We could definitely add it.
TPR: In March, MarkMail imported 75 Perl lists with over 500,000 messages, some going back to 1995. How does that compare to other imports? The Postgres lists seemed to several hundred thousand messages too.
Jason: Perl is huge, that's for sure. We've loaded 500,000 emails and we're looking to add even more, thanks to some people from the Perl community who have their own message histories earlier than the ones we originally loaded.
Perl's not quite the biggest though. If we look at message traffic for different communities, and if we focus on human-authored email chatter (that is, if we exclude check-in emails and bug change notifications), we have a few behemoths:
TPR: What's the error rate on an import that size? That is, are there any that you have to look at manually and adjust the importer? Do you need to import in more than one phase or do clean-up steps before the messages reach XML? How dirty of a job is this really?
Jason: Very good question! It's definitely a messier job than most people realize. We test load on a staging server and examine the results in MarkMail. Issues tend to jump out. For example, the chart makes it easy to spot gaps or spikes in the history. We've learned that gaps typically indicate a problem gathering the history, while a spike most often indicates a spam outbreak.
TPR: The natural comparison for MarkMail is Google Groups, which provides the same sort of service for usenet messages. Given that usenet messages look very much like email (and some lists exist as both mailing lists and newsfeeds), can you apply MarkLogic to usenet?
Jason: Email and newsgroups are both time-oriented, author-attributed communication that's distributed to group focused on a particular topic. The MarkMail features focusing on time, authorship, and topics would make sense in both domains.
TPR: What is the difference between indexing usenet and mailing lists? Is the problem a lot thornier than it looks? Does one have more connecting or referencal information than the other?
Jason: Personally, it's been a long time since I posted on or read Usenet. I tend to spend my time on more tightly-focused community-oriented mailing lists, which for me are things like the JDOM development list and the MarkLogic community list. These don't have newsgroup counterparts.
I believe each public email needs a home on the web, a permalink that's easy to find and share. That's something we're trying to do with MarkMail.
TPR: There are hundreds of Perl-related lists. What can a mailing list representative do to make it easy for you to import his lists? Do you just need a pointer, or can they send an archive? If you can send an archive, what's the best way to structure it? From the time that you import a new list, how long does it take to show up in MarkLogic?
Jason: To load a list we do two things:
First, we subscribe a robot user to the list. This robot receives mail like any other subscriber but redirects each incoming mail into the MarkMail archive. Within seconds of the robot user having received a new mail, it's available on the site and will appear in search results.
Second, we work hard to obtain and load the back history for each list. Sometimes this work feels like "email archaeology", trying to scrounge up good records from many years ago.
List admins can help a lot with both these tasks. For subscribing, we support a web service interface that list admins can use to keep us informed as new list are created. We do this a lot with large communities where new lists frequently appear, such as CodeHaus.
For back histories, list admins usually have the inside track on getting the best archives. For example, Mailman often exposes somewhat-mangled Pipermail archives publicly but keeps pristine mbox files under /usr/local/mailman/archives for those with permissions to view them. Sometimes having special access doesn't matter. For example, the Perl list archives were readily available as newsgroup items over NNTP.
People can send us requests to load new lists using our feedback form. If the requestor is the list admin or can hook us up with the list admin, we can usually proceed more quickly.
TPR: Are you interested in "dead" or "historical" lists that are no longer active or have been shut down? If someone has an archive of those, how can they load them into MarkMail?
Jason: If an archive has lots of information that's still relevant today, we'll definitely load it. If it's of historic interest only, we'll load it as well but we won't prioritize it.
TPR: Will there be public APIs for accessing MarkMail through third-party programs?
Jason: Yes, absolutely. If people have ideas for what they'd like to see here, or can tell us about concrete ways in which they'd immediately make use of such APIs, we'd really like to hear it (http://markmail.org/docs/feedback.xqy).
TPR: Are people already making third-party applications based on MarkMail, such as Facebook applications?
Jason: What we see people doing most of the time is embedding the MarkMail traffic histogram into their pages, showing off their list, their community, or their own posts. For example, the perl.org site embedded a chart and search box, and so have others, like NetcoolUsers. We're working to make this easier.
TPR: How about an X Prize style competition for the best third party application using MarkMail? Companies I use, such as Netflix and Whitepages.com, have been doing that and it seems to stimulate the open source communities to contribute free work.
Jason: Are you asking for MarkLogic, or MarkMail? I'll answer both!
For MarkLogic Server, there's a Community Edition that's free for use, even in production, capped at 100 Megs of stored content. That's a fair amount of content for an individual (it'd let me run MarkMail against my own Inbox). Some people who've gotten started with that version and had a compelling idea for a non-commercial application have asked for and received licenses.
For MarkMail, first we need to get our web service APIs in place and open up the various components. Then we can consider something like this.
TPR: Ohloh.net has a nice feature that allows people to claim an identity. In MarkMail, I see messages from myself show up under several different email addresses over the year, and with variations on my name. Will there we a way that someone can say "This is me", and treat those different email addresses as a single person?
Jason: Yes. The question we're wondering is, will people mind if others can see those associations?
TPR: Part of your model is to help search engines such as Google, Yahoo!, and MSN find the content MarkMail has indexed so they can index it themselves. How do you do that?
Jason: It's a serious challenge encouraging Google, Yahoo, and MSN to crawl and index our many millions of web pages. I think it's fair to say there are very few sites in the world with more than 10 million truly unique and useful pages, so it's not something for which the search engines optimize. It's easy to trip the spam detecting heuristics. We've had to work hard to continue being seen as the good guys.
Exposing the mails to crawlers isn't that challenging. We advertise our mails using sitemap files, referenced in our robots.txt and pulled by all the major crawlers to learn about our site. We also have a Browse link in our footer designed for crawlers to traverse and, via a hierarchical crawl, hit every message. We're getting some good traction now. The Google crawler, for example, pulls pages at a rate of 8 per second every second of every day.
Much of our referral traffic comes from Google, so we like to make their job easier in finding the content we offer. We try to make it so that people who find us via Google remember us and come back to us directly, because we offer a very email-focused feature set:
- Ability to limit by date, sender, list, attachment name, etc.
- An understanding that list:james means to search the Apache JAMES server lists while from:james means a person's name. With Google you have a very hard time searching for things related to Apache JAMES because "james" is so common a word.
- Immediate inclusion of new mails in the index. No need to wait to see what people are talking about right now.
- No duplicate emails. MarkMail has one copy of all the mails, and just one.
- An understanding of email structure so you can exclude things like quoted text. This lets you find when a particular person said a particular thing.
- An integrated search and viewing experience, with keyboard shortcuts like "n" and "p" to move to the next and previous message in the results.
- A histogram traffic chart displayed for every search result.