Getting a grip on archiving mail threads
I’m subscribed to both the Ruby Talk and Ruby on Rails mailing lists. Both are high volume. I typically don’t have enough time to read all that is going on, but I do like to have the emails around so I can search for a specific topic.
I like to keep my high volume mailing lists’ threads archived by month. This
means that the topic thread head’s Date header determines where the entire
thread is archived, even if the thread children’s Date header is a different
month. For a low volume lists, this can be done by hand using any mail client.
For high volume lists, doing it by hand is tedious and prone to mistakes.
Computers are for this type of task. It is time to work hard at being
lazy…
Here’s what I did to tackle this problem. My time was limited, I only had a
couple of hours to create something to do the above for my two high volume
lists. I had two Maildirs containing the Ruby Talk (~/.maildir/.ruby.talk)
and Ruby on Rails (~/.maildir/.ruby.rails) mailing lists. Each contained
more than 50,000 emails stored in individual files in the lists’ /cur
directory.
So my ~/.maildir is organized like the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
.ruby.rails/ .ruby.rails.200603/ .ruby.rails.200604/ .ruby.rails.200605/ .ruby.rails.200606/ .ruby.rails.200607/ .ruby.rails.200608/ .ruby.rails.200609/ .ruby.rails.200610/ .ruby.rails.200611/ .ruby.talk/ .ruby.talk.200603/ .ruby.talk.200604/ .ruby.talk.200605/ .ruby.talk.200606/ .ruby.talk.200607/ .ruby.talk.200608/ .ruby.talk.200609/ .ruby.talk.200610/ .ruby.talk.200611/ |
The requirements were:
- Archive threads into the appropriate archive directories (should correctly archive 99.9% of the time).
- Keep track of thread heads and their associated archive location so subsequent runs catch thread children dated after the previous run.
- Shouldn’t consume excessive amounts of memory.
Since I intended to be the sole user of this program and the scope of functionality was so narrow, I decided to write a self contained script to flesh out the logic and behavior. This meant that testing by hand was OK for me (if this was developed for someone else, I would not choose this path). Future development iterations, I will break out the functionality into classes and modules along with real test specs.
The next decision I had was to decided how to process email headers. Since TMail is being maintained again, I decided to use it instead of parsing the email headers my self.
The following is the heavily commented script that I created. The most current source can be found at http://svn.drotner.org/repos/unix_tools/trunk/bin/mail_sort.rb
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 |
#!/usr/bin/env ruby # Author: Kelly McCauley # Copyright 2007 Kelly McCauley # Source: http://svn.drotner.org/repos/unix_tools/trunk/bin/mail_sort.rb # License: version 0.1 is Public Domain require 'rubygems' require 'optparse' # Parses commandline options require 'tmail' # Handles the email parsing require 'date' # Ruby's date library require 'fileutils' # File and directory manipulation libarary $VERBOSE = true @version = '0.1' @debug = 0 @quiet = false @days_ago = 30 # Default "Sort and archive mail up to @days_ago". @src_mail_dir = nil # Maildir to sort and archive. @thread_heads = {} # Maps a thread head's Message-ID to # its associated archive directory. @thread_head_index = nil # Location of a saved version of @thread_heads # from a previous run. @total_orphans = 0 # Count of thread children that have no parents. @total_emails = 0 # Total emails read. @total_emails_archived = 0 # Total emails that were moved to an archive # directory. # # Methods # # Prints out the given msgs and opts to STDERR and then exits def error_exit(opts, *msgs) msgs.each {|m| $stderr.print m} $stderr.puts opts exit(1) end # Loads a saved @thread_heads from a previous run into memory. def load_thread_head_index(index_file) if File.file?(index_file) File.open(index_file) do |file| file.each_line do |line| key, year, sum, mon = line.chomp.split(/\t/) @thread_heads[key.to_sym] = [year.to_sym, sum.to_i, mon.to_sym] end end end end # Dumps @thread_heads that are less than 365 days ago to a file. # # I didn't serialize it to YAML because I didn't want the extra processing # overhead or memory consumption. I didn't Marshal it since I wanted the # saved file to be tied to the particular version of Marshal. def dump_thread_head_index(index_file) File.open(index_file, 'w') do |file| @thread_heads.each do |key,value| next if value[1] < @th_index_cutoff_sum file << "#{key.to_s}\t#{value.map{|x| x.to_s}.join("\t")}\n" @th_index_dump_count += 1 end end end # Adds the given email to the @thread_heads lookup table. def add_thread_head(email) unless @thread_heads.key?(email['message-id'].id.to_sym) $stderr.puts "th subject: '#{email['subject'].to_s}'" if @debug > 2 @thread_heads[email['message-id'].id.to_sym] = [ email['date'].date.year.to_s.to_sym, ( email['date'].date.year.to_s + sprintf('%02d', email['date'].date.mon) + sprintf('%02d', email['date'].date.day) ).to_i, sprintf('%02d', email['date'].date.mon).to_sym, ] end end # Creates the archive maildir def create_archive_maildir(root_archive_dir) sub_dirs = [] sub_dirs << File.join(root_archive_dir, 'cur') sub_dirs << File.join(root_archive_dir, 'new') sub_dirs << File.join(root_archive_dir, 'tmp') options = {} options[:noop] = true if @debug > 2 options[:verbose] = true if @debug > 1 sub_dirs.each do |dir| unless File.directory?(dir) FileUtils.mkdir_p(dir, options) end end return sub_dirs end # Archives the given file to the give archive directory def archive_email(root_archive_dir, filename) archive_dir = create_archive_maildir(root_archive_dir).shift options = {} options[:noop] = true if @debug > 2 options[:verbose] = true if @debug > 1 if @debug > 0 FileUtils.cp(filename, archive_dir, options) else FileUtils.mv(filename, archive_dir) end @total_emails_archived += 1 end # Archives the thread child email into the appropriate maildir def archive_thread_child(thread_head, src_mail_dir, filename) $stderr.puts "tc #{filename}: #{@thread_heads[thread_head][1]} <= #{@cutoff_sum}" if @debug > 2 if (@thread_heads[thread_head][1] <= @cutoff_sum) $stderr.puts "tc filename: #{filename}" if @debug > 2 root_archive_dir = "#{File.expand_path(src_mail_dir)}.#{@thread_heads[thread_head].first.to_s}#{@thread_heads[thread_head].last.to_s}" archive_email(root_archive_dir, filename) end end # Archives the thread head email into the appropriate maildir def archive_thread_head(email, src_mail_dir, filename) # Determine this email's date sum. email_sum = ( email['date'].date.year.to_s + sprintf('%02d', email['date'].date.mon) + sprintf('%02d', email['date'].date.day) ).to_i $stderr.puts "th #{filename}: #{email_sum} <= #{@cutoff_sum}" if @debug > 2 # Is the email before the cutoff date? if email_sum <= @cutoff_sum # Yes. $stderr.puts "th filename: #{filename}" if @debug > 2 root_archive_dir = "#{File.expand_path(src_mail_dir)}.#{email['date'].date.year}#{sprintf('%02d', email['date'].date.mon)}" # Archive it. archive_email(root_archive_dir, filename) end end # # Handle the commandline arguments # opts = OptionParser.new do |opts| opts.banner = "Usage: #{$0} [OPTIONS] MAILDIR" opts.separator("") opts.separator("OPTIONS") opts.on( '-D','--days-ago NUMBER', 'Sort and archive mail up to --days-ago' ) do |days| @days_ago = days end opts.on( '-i','--thread-head-index FILE', 'Specify the thread head index file' ) do |file| @thread_head_idx = file end opts.on_tail( '-q','--quiet', 'Turns off all output including error output' ) do |q| @quiet = true end opts.on_tail( '-d','--debug', 'Turns on debugging output' ) do |debug| @debug += 1 end # help opts.on_tail( '-h', '--help', 'Shows this message' ) do || error_exit(opts) end # version opts.on_tail( '-V', '--version', 'Shows the version and copyright of db_diff' ) do || error_exit(opts, "#{$0} version #{@version}\n") end end opts.parse!(ARGV) # Make sure that the source Maildir is given and that the directory exists. @src_mail_dir = ARGV.shift error_exit( opts, "ERROR: failed to specify a MAILDIR\n" ) unless @src_mail_dir error_exit( opts, "ERROR: MAILDIR does not exist: #{@src_mail_dir}\n" ) unless File.directory?(@src_mail_dir) # # Determine the cut-off dates. Used in simple numerical comparison of dates. # # The cut-off date for determining if thread heads are targeted for archival. @cutoff = Date.today - @days_ago @cutoff_sum = ( @cutoff.year.to_s + sprintf('%02d', @cutoff.mon) + sprintf('%02d', @cutoff.day) ).to_i # The cut-off date for storing thread heads in @thread_heads. thi = Date.today - 365 @th_index_cutoff_sum = ( thi.year.to_s + sprintf('%02d', thi.mon) + sprintf('%02d', thi.day) ).to_i @th_index_dump_count = 0 # Compose the location of the thread head index file if @thread_head_index.nil? @thread_head_index = "#{File.expand_path(@src_mail_dir)}.mail_sort.idx" end # Pre-run debugging if @debug > 0 $stderr.puts "@debug: '#{@debug}'" $stderr.puts "@src_mail_dir: '#{@src_mail_dir}'" $stderr.puts "@thread_head_index: '#{@thread_head_index}'" $stderr.puts "@days_ago: '#{@days_ago}'" $stderr.puts "@cutoff: '#{@cutoff.to_s}'" $stderr.puts "@cutoff_sum: '#{@cutoff_sum.to_s}'" $stderr.puts "@th_index_cutoff_sum: '#{@th_index_cutoff_sum}'" end # # Do the run. # # Load the thread head index if it exists. load_thread_head_index(@thread_head_index) # The location of the Maildir's cur directory. src_mail_dir_cur = File.join(File.expand_path(@src_mail_dir),'cur') # Iterate through each file in the Maildir's cur directory. Dir.foreach(src_mail_dir_cur) do |filename| # Skip . and .. next if filename == '.' next if filename == '..' filename = File.join(src_mail_dir_cur, filename) # Skip any directories. next unless File.file?(filename) $stderr.puts "filename: #{filename}" if @debug > 2 # Parse the file into an email. email = TMail::Mail.parse(IO.read(filename)) if email['references'].nil? && email['in-reply-to'].nil? # This email is a thread head if email['message-id'].id.nil? # This email is a malformed email. $stderr.puts "No message-id for #{filename}" unless @quiet else # Add this email as a thread head. add_thread_head(email) # Archive this email. archive_thread_head(email, @src_mail_dir, filename) end else # This email is a thread child thread_head = nil # Determine the thread's head (Simple case first since it is the most # common) if !email['references'].nil? && !email['references'].ids.empty? # This email as a References header and it is not empty thread_head = email['references'].ids.first.to_sym elsif !email['in-reply-to'].nil? && !email['in-reply-to'].empty? # This email only has a In-Reply-To header which is not empty thread_head = email['in-reply-to'].to_s.to_sym end # Lookup the thread head in @thread_heads. if @thread_heads.key?(thread_head) # Found it, so archive this email in the thread head's archive directory. archive_thread_child(thread_head, @src_mail_dir, filename) else # Possibly an orphaned child. See if any of its other references are # known thread heads. thread_head = nil if email['references'].nil? && !email['in-reply-to'].empty? # No References header so use the In-Reply-To header. ref = email['in-reply-to'].to_s.to_sym thread_head = ref if @thread_heads.key?(ref) elsif !email['references'].nil? && !email['references'].empty? # Use References header. Iterate through each of the references and # use the first that matches as the thread's head. email['references'].ids.each do |ref| ref = ref.to_s.to_sym if @thread_heads.key?(ref) thread_head = ref break end end end # Do we now have the thread's head? if thread_head # Yes, so archive this email in the thread head's archive directory. archive_thread_child(thread_head, @src_mail_dir, filename) else # No. We have an orphan. $stderr.puts "th orphan" if @debug > 2 @total_orphans += 1 # Archive it as a thread head. add_thread_head(email) archive_thread_head(email, @src_mail_dir, filename) end end end @total_emails += 1 end # The run is done, so save @thread_heads. dump_thread_head_index(@thread_head_index) # Post-run debugging. if @debug > 0 $stderr.puts "@thread_heads.length: #{@thread_heads.length}" $stderr.puts "@total_orphans: #{@total_orphans}" $stderr.puts "@total_emails: #{@total_emails}" $stderr.puts "@total_emails_archived: #{@total_emails_archived}" $stderr.puts "@th_index_dump_count: #{@th_index_dump_count}" end |
Invoking it is as simple as ./mail_sort.rb -h.
Ruby Method of the Day - Array.reject!
Signature
array.reject! {|element| block} #=> array or nil
array.reject {|element| block} does the exact same thing as
Array.delete_if
except that it returns nil if no changes were made to array.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] b = a.clone #=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] b.reject! {true} #=> [] b #=> [] b = a.clone #=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] b.reject! {false} #=> nil b #=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] b = a.clone #=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] b.reject! {|n| n == 3} #=> [1, 2, 4, 5, 6, 7, 8, 9, 10] b #=> [1, 2, 4, 5, 6, 7, 8, 9, 10] b = a.clone #=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] b.reject! {|n| n % 2 == 0} #=> [1, 3, 5, 7, 9] b #=> [1, 3, 5, 7, 9] |
Documentation Reference
Ruby version 1.8.6
Ruby Method of the Day - Holiday Break
I’m taking a few weeks off from writing rmotds so I can catch up on some other little pet projects. I’ll start them back up on 2008/01/01.
Ruby Method of the Day - Array.reject
Signature
array.reject {|element| block} #=> new_array
array.reject {|element| block} iterates over
array’s elements and returns new_array that contains any element
in array where the block returns either nil or false.
Examples
1 2 3 4 5 6 7 8 9 10 11 |
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] a.reject {|n| nil} #=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] a.reject {|n| false} #=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] a.reject {|n| true} #=> [] a.reject {|n| ''} #=> [] a.reject {|n| 0} #=> [] a.reject {|n| n == 3} #=> [1, 2, 4, 5, 6, 7, 8, 9, 10] a.reject {|n| n % 2 == 0 } #=> [1, 3, 5, 7, 9] a.reject {|n| true if (n % 3 == 0) || (n % 5 == 0) } #=> [1, 2, 4, 7, 8] |
Documentation Reference
Ruby version 1.8.6
Ruby Method of the Day - Array.last
Signature
array.last #=> object or nil array.last(number) #=> new_array
array.last returns the last element of array or it returns nil if
array is empty. array.last(number) returns the last number elements of
array or it returns an empty array if array is empty.
Examples
1 2 3 4 5 6 7 8 9 10 |
a = ["a", "b", "c", "d", "e", "f"] a.last #=> "f" [].last #=> nil a.last(0) #=> [] a.last(1) #=> ["f"] a.last(4) #=> ["c", "d", "e", "f"] a.last(99) #=> ["a", "b", "c", "d", "e", "f"] [].last(10) #=> [] |
Documentation Reference
Ruby version 1.8.6
Ruby Method of the Day - Array.first
Signature
array.first #=> object or nil array.first(number) #=> new_array
array.first returns the first element of array or it returns nil if
array is empty. array.first(number) returns the first number elements of
array or it returns an empty array if array is empty.
Examples
1 2 3 4 5 6 7 8 9 10 |
a = ["a", "b", "c", "d", "e", "f"] a.first #=> "a" [].first #=> nil a.first(0) #=> [] a.first(1) #=> ["a"] a.first(99) #=> ["a", "b", "c", "d", "e", "f"] [].first(10) #=> [] |
Documentation Reference
Ruby version 1.8.6
Ruby Method of the Day - Array.fetch
Signature
array.fetch(position) #=> object
array.fetch(position) returns the object located at position. If
position is outside array an IndexError is thrown. Note that if
position is negative and outside of array then the index number reported in
the exception will be reported as position - array.length. I’ve
submitted a
patch
to fix it. If position is 0 or positive then start counting from the
beginning of the array. If position is negative then start counting from the
end of the array.
array.fetch(position, default) #=> object
array.fetch(position, default) returns the object located at
position. If position is outside of array then default is returned.
If position is 0 or positive then start counting from the beginning of the
array. If position is negative then start counting from the end of the
array.
array.fetch(position) {|position| block} #=> object
array.fetch(position) {|position| block} returns the object
located at position. If position is outside of array then
block’s returned results is returned. If position is 0 or
positive then start counting from the beginning of the array. If position is
negative then start counting from the end of the array.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
a = ['a','b','c','d','e','f'] #=> ["a", "b", "c", "d", "e", "f"] a.fetch(2) #=> "c" a.fetch(-2) #=> "e" begin a.fetch(99) rescue Exception => e e.inspect #=> "#<IndexError: index 99 out of array>" end begin a.fetch(-99) rescue Exception => e # Note that it reports an index of -93 (i.e. -99 - a.length). e.inspect #=> "#<IndexError: index -93 out of array>" end begin a.fetch(-7) rescue Exception => e # Note that it reports an index of -1 (i.e. -7 - a.length). e.inspect #=> "#<IndexError: index -1 out of array>" end a.fetch(4, 'z') #=> "e" a.fetch(99, 'z') #=> "z" a.fetch(4){false} #=> "e" a.fetch(99){false} #=> false def default_fetch(ary, position) element = nil if position > ary.length element = ary.last else element = ary.first end end a.fetch(99){|position| default_fetch(a, position)} #=> "f" a.fetch(4){|position| default_fetch(a, position)} #=> "e" a.fetch(-99){|position| default_fetch(a, position)} #=> "a" |
Documentation Reference
Ruby version 1.8.6
Ruby Method of the Day - Array.delete_if
Signature
array.delete_if {|element| block} #=> array
array.delete_if {|element| block} iterates over all elements of
array and deletes an element from array if block returns true for that
element.
Examples
1 2 3 |
a = [1,2,3,4,5,6,7,8,9,10] #=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] a.delete_if {|n| n % 2 == 0} #=> [1, 3, 5, 7, 9] a #=> [1, 3, 5, 7, 9] |
The following is another usage example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
#!/usr/bin/env ruby # # A contrived example of using Array.delete_if # $VERBOSE = true require 'pp' # Member of an organization. class Member attr_accessor :name attr_accessor :current def current? @current end end # Raw member data members_data = [ ['Bob', true], ['Fred', false], ['Bill', false], ['Alice', true] ] # Create the members based on the raw member data members = members_data.map do |entry| member = Member.new member.name = entry[0] member.current = entry[1] member end puts 'Before purging of non-current members...' pp members puts "\n" # Purge non-current members. members.delete_if {|member| !member.current?} puts 'After purging of non-current members...' pp members |
Produces…
Before purging of non-current members... [#<Member:0x87780 @current=true, @name="Bob">, #<Member:0x877a8 @current=false, @name="Fred">, #<Member:0x87794 @current=false, @name="Bill">, #<Member:0x8776c @current=true, @name="Alice">] After purging of non-current members... [#<Member:0x87780 @current=true, @name="Bob">, #<Member:0x8776c @current=true, @name="Alice">]
Documentation Reference
Ruby version 1.8.6
Ruby Method of the Day - Array.delete
Signature
array.delete(object) #=> object or nil array.delete(object) {|o| block } #=> object or returned_block_result
array.delete(object) removes all occurances of object from array and
returns object if object is found in array or it returns nil if
object is not found in array. If block is given and object is not
found in array then the the block’s returned results is returned
by array.delete.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
a = [1,2,3,4,5,6,7] #=> [1, 2, 3, 4, 5, 6, 7] a.delete(7) #=> 7 a #=> [1, 2, 3, 4, 5, 6] a.delete('foo') #=> nil a = [7,1,2,3,7,4,5,6,7] #=> [7, 1, 2, 3, 7, 4, 5, 6, 7] a.delete(7) #=> 7 a #=> [1, 2, 3, 4, 5, 6] a.delete('foo') {false} #=> false a.delete('foo') do |item| "'#{item}' not found" end #=> "'foo' not found" |
Documentation Reference
Ruby version 1.8.6
Ruby Method of the Day - Array.compact and Array.compact!
Signature
array.compact #=> new_array array.compact! #=> array or nil
array.compact returns new_array that contains all non-nil
elements in array (nil elements removed). array.compact! either returns
array with all nil elements removed or returns nil if no nil
elements were removed.
Examples
1 2 3 4 5 6 |
a = [1, nil, nil, 4, nil] #=> [1, nil, nil, 4, nil] a.compact #=> [1, 4] a #=> [1, nil, nil, 4, nil] a.compact! #=> [1, 4] a #=> [1, 4] a.compact! #=> nil |
Documentation Reference
Ruby version 1.8.6
