Getting a grip on archiving mail threads
I’m subscribed to both the Ruby Talk and Ruby on Rails mailing lists. Both are high volume. I typically don’t have enough time to read all that is going on, but I do like to have the emails around so I can search for a specific topic.
I like to keep my high volume mailing lists’ threads archived by month. This
means that the topic thread head’s Date header determines where the entire
thread is archived, even if the thread children’s Date header is a different
month. For a low volume lists, this can be done by hand using any mail client.
For high volume lists, doing it by hand is tedious and prone to mistakes.
Computers are for this type of task. It is time to work hard at being
lazy…
Here’s what I did to tackle this problem. My time was limited, I only had a
couple of hours to create something to do the above for my two high volume
lists. I had two Maildirs containing the Ruby Talk (~/.maildir/.ruby.talk)
and Ruby on Rails (~/.maildir/.ruby.rails) mailing lists. Each contained
more than 50,000 emails stored in individual files in the lists’ /cur
directory.
So my ~/.maildir is organized like the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
.ruby.rails/ .ruby.rails.200603/ .ruby.rails.200604/ .ruby.rails.200605/ .ruby.rails.200606/ .ruby.rails.200607/ .ruby.rails.200608/ .ruby.rails.200609/ .ruby.rails.200610/ .ruby.rails.200611/ .ruby.talk/ .ruby.talk.200603/ .ruby.talk.200604/ .ruby.talk.200605/ .ruby.talk.200606/ .ruby.talk.200607/ .ruby.talk.200608/ .ruby.talk.200609/ .ruby.talk.200610/ .ruby.talk.200611/ |
The requirements were:
- Archive threads into the appropriate archive directories (should correctly archive 99.9% of the time).
- Keep track of thread heads and their associated archive location so subsequent runs catch thread children dated after the previous run.
- Shouldn’t consume excessive amounts of memory.
Since I intended to be the sole user of this program and the scope of functionality was so narrow, I decided to write a self contained script to flesh out the logic and behavior. This meant that testing by hand was OK for me (if this was developed for someone else, I would not choose this path). Future development iterations, I will break out the functionality into classes and modules along with real test specs.
The next decision I had was to decided how to process email headers. Since TMail is being maintained again, I decided to use it instead of parsing the email headers my self.
The following is the heavily commented script that I created. The most current source can be found at http://svn.drotner.org/repos/unix_tools/trunk/bin/mail_sort.rb
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 |
#!/usr/bin/env ruby # Author: Kelly McCauley # Copyright 2007 Kelly McCauley # Source: http://svn.drotner.org/repos/unix_tools/trunk/bin/mail_sort.rb # License: version 0.1 is Public Domain require 'rubygems' require 'optparse' # Parses commandline options require 'tmail' # Handles the email parsing require 'date' # Ruby's date library require 'fileutils' # File and directory manipulation libarary $VERBOSE = true @version = '0.1' @debug = 0 @quiet = false @days_ago = 30 # Default "Sort and archive mail up to @days_ago". @src_mail_dir = nil # Maildir to sort and archive. @thread_heads = {} # Maps a thread head's Message-ID to # its associated archive directory. @thread_head_index = nil # Location of a saved version of @thread_heads # from a previous run. @total_orphans = 0 # Count of thread children that have no parents. @total_emails = 0 # Total emails read. @total_emails_archived = 0 # Total emails that were moved to an archive # directory. # # Methods # # Prints out the given msgs and opts to STDERR and then exits def error_exit(opts, *msgs) msgs.each {|m| $stderr.print m} $stderr.puts opts exit(1) end # Loads a saved @thread_heads from a previous run into memory. def load_thread_head_index(index_file) if File.file?(index_file) File.open(index_file) do |file| file.each_line do |line| key, year, sum, mon = line.chomp.split(/\t/) @thread_heads[key.to_sym] = [year.to_sym, sum.to_i, mon.to_sym] end end end end # Dumps @thread_heads that are less than 365 days ago to a file. # # I didn't serialize it to YAML because I didn't want the extra processing # overhead or memory consumption. I didn't Marshal it since I wanted the # saved file to be tied to the particular version of Marshal. def dump_thread_head_index(index_file) File.open(index_file, 'w') do |file| @thread_heads.each do |key,value| next if value[1] < @th_index_cutoff_sum file << "#{key.to_s}\t#{value.map{|x| x.to_s}.join("\t")}\n" @th_index_dump_count += 1 end end end # Adds the given email to the @thread_heads lookup table. def add_thread_head(email) unless @thread_heads.key?(email['message-id'].id.to_sym) $stderr.puts "th subject: '#{email['subject'].to_s}'" if @debug > 2 @thread_heads[email['message-id'].id.to_sym] = [ email['date'].date.year.to_s.to_sym, ( email['date'].date.year.to_s + sprintf('%02d', email['date'].date.mon) + sprintf('%02d', email['date'].date.day) ).to_i, sprintf('%02d', email['date'].date.mon).to_sym, ] end end # Creates the archive maildir def create_archive_maildir(root_archive_dir) sub_dirs = [] sub_dirs << File.join(root_archive_dir, 'cur') sub_dirs << File.join(root_archive_dir, 'new') sub_dirs << File.join(root_archive_dir, 'tmp') options = {} options[:noop] = true if @debug > 2 options[:verbose] = true if @debug > 1 sub_dirs.each do |dir| unless File.directory?(dir) FileUtils.mkdir_p(dir, options) end end return sub_dirs end # Archives the given file to the give archive directory def archive_email(root_archive_dir, filename) archive_dir = create_archive_maildir(root_archive_dir).shift options = {} options[:noop] = true if @debug > 2 options[:verbose] = true if @debug > 1 if @debug > 0 FileUtils.cp(filename, archive_dir, options) else FileUtils.mv(filename, archive_dir) end @total_emails_archived += 1 end # Archives the thread child email into the appropriate maildir def archive_thread_child(thread_head, src_mail_dir, filename) $stderr.puts "tc #{filename}: #{@thread_heads[thread_head][1]} <= #{@cutoff_sum}" if @debug > 2 if (@thread_heads[thread_head][1] <= @cutoff_sum) $stderr.puts "tc filename: #{filename}" if @debug > 2 root_archive_dir = "#{File.expand_path(src_mail_dir)}.#{@thread_heads[thread_head].first.to_s}#{@thread_heads[thread_head].last.to_s}" archive_email(root_archive_dir, filename) end end # Archives the thread head email into the appropriate maildir def archive_thread_head(email, src_mail_dir, filename) # Determine this email's date sum. email_sum = ( email['date'].date.year.to_s + sprintf('%02d', email['date'].date.mon) + sprintf('%02d', email['date'].date.day) ).to_i $stderr.puts "th #{filename}: #{email_sum} <= #{@cutoff_sum}" if @debug > 2 # Is the email before the cutoff date? if email_sum <= @cutoff_sum # Yes. $stderr.puts "th filename: #{filename}" if @debug > 2 root_archive_dir = "#{File.expand_path(src_mail_dir)}.#{email['date'].date.year}#{sprintf('%02d', email['date'].date.mon)}" # Archive it. archive_email(root_archive_dir, filename) end end # # Handle the commandline arguments # opts = OptionParser.new do |opts| opts.banner = "Usage: #{$0} [OPTIONS] MAILDIR" opts.separator("") opts.separator("OPTIONS") opts.on( '-D','--days-ago NUMBER', 'Sort and archive mail up to --days-ago' ) do |days| @days_ago = days end opts.on( '-i','--thread-head-index FILE', 'Specify the thread head index file' ) do |file| @thread_head_idx = file end opts.on_tail( '-q','--quiet', 'Turns off all output including error output' ) do |q| @quiet = true end opts.on_tail( '-d','--debug', 'Turns on debugging output' ) do |debug| @debug += 1 end # help opts.on_tail( '-h', '--help', 'Shows this message' ) do || error_exit(opts) end # version opts.on_tail( '-V', '--version', 'Shows the version and copyright of db_diff' ) do || error_exit(opts, "#{$0} version #{@version}\n") end end opts.parse!(ARGV) # Make sure that the source Maildir is given and that the directory exists. @src_mail_dir = ARGV.shift error_exit( opts, "ERROR: failed to specify a MAILDIR\n" ) unless @src_mail_dir error_exit( opts, "ERROR: MAILDIR does not exist: #{@src_mail_dir}\n" ) unless File.directory?(@src_mail_dir) # # Determine the cut-off dates. Used in simple numerical comparison of dates. # # The cut-off date for determining if thread heads are targeted for archival. @cutoff = Date.today - @days_ago @cutoff_sum = ( @cutoff.year.to_s + sprintf('%02d', @cutoff.mon) + sprintf('%02d', @cutoff.day) ).to_i # The cut-off date for storing thread heads in @thread_heads. thi = Date.today - 365 @th_index_cutoff_sum = ( thi.year.to_s + sprintf('%02d', thi.mon) + sprintf('%02d', thi.day) ).to_i @th_index_dump_count = 0 # Compose the location of the thread head index file if @thread_head_index.nil? @thread_head_index = "#{File.expand_path(@src_mail_dir)}.mail_sort.idx" end # Pre-run debugging if @debug > 0 $stderr.puts "@debug: '#{@debug}'" $stderr.puts "@src_mail_dir: '#{@src_mail_dir}'" $stderr.puts "@thread_head_index: '#{@thread_head_index}'" $stderr.puts "@days_ago: '#{@days_ago}'" $stderr.puts "@cutoff: '#{@cutoff.to_s}'" $stderr.puts "@cutoff_sum: '#{@cutoff_sum.to_s}'" $stderr.puts "@th_index_cutoff_sum: '#{@th_index_cutoff_sum}'" end # # Do the run. # # Load the thread head index if it exists. load_thread_head_index(@thread_head_index) # The location of the Maildir's cur directory. src_mail_dir_cur = File.join(File.expand_path(@src_mail_dir),'cur') # Iterate through each file in the Maildir's cur directory. Dir.foreach(src_mail_dir_cur) do |filename| # Skip . and .. next if filename == '.' next if filename == '..' filename = File.join(src_mail_dir_cur, filename) # Skip any directories. next unless File.file?(filename) $stderr.puts "filename: #{filename}" if @debug > 2 # Parse the file into an email. email = TMail::Mail.parse(IO.read(filename)) if email['references'].nil? && email['in-reply-to'].nil? # This email is a thread head if email['message-id'].id.nil? # This email is a malformed email. $stderr.puts "No message-id for #{filename}" unless @quiet else # Add this email as a thread head. add_thread_head(email) # Archive this email. archive_thread_head(email, @src_mail_dir, filename) end else # This email is a thread child thread_head = nil # Determine the thread's head (Simple case first since it is the most # common) if !email['references'].nil? && !email['references'].ids.empty? # This email as a References header and it is not empty thread_head = email['references'].ids.first.to_sym elsif !email['in-reply-to'].nil? && !email['in-reply-to'].empty? # This email only has a In-Reply-To header which is not empty thread_head = email['in-reply-to'].to_s.to_sym end # Lookup the thread head in @thread_heads. if @thread_heads.key?(thread_head) # Found it, so archive this email in the thread head's archive directory. archive_thread_child(thread_head, @src_mail_dir, filename) else # Possibly an orphaned child. See if any of its other references are # known thread heads. thread_head = nil if email['references'].nil? && !email['in-reply-to'].empty? # No References header so use the In-Reply-To header. ref = email['in-reply-to'].to_s.to_sym thread_head = ref if @thread_heads.key?(ref) elsif !email['references'].nil? && !email['references'].empty? # Use References header. Iterate through each of the references and # use the first that matches as the thread's head. email['references'].ids.each do |ref| ref = ref.to_s.to_sym if @thread_heads.key?(ref) thread_head = ref break end end end # Do we now have the thread's head? if thread_head # Yes, so archive this email in the thread head's archive directory. archive_thread_child(thread_head, @src_mail_dir, filename) else # No. We have an orphan. $stderr.puts "th orphan" if @debug > 2 @total_orphans += 1 # Archive it as a thread head. add_thread_head(email) archive_thread_head(email, @src_mail_dir, filename) end end end @total_emails += 1 end # The run is done, so save @thread_heads. dump_thread_head_index(@thread_head_index) # Post-run debugging. if @debug > 0 $stderr.puts "@thread_heads.length: #{@thread_heads.length}" $stderr.puts "@total_orphans: #{@total_orphans}" $stderr.puts "@total_emails: #{@total_emails}" $stderr.puts "@total_emails_archived: #{@total_emails_archived}" $stderr.puts "@th_index_dump_count: #{@th_index_dump_count}" end |
Invoking it is as simple as ./mail_sort.rb -h.
