This page is READ-ONLY. It is generated from the old site.
All timestamps are relative to 2013 (when this page is generated).
If you are looking for TeX support, please go to VietTUG.org

I am an email killer

terrible
Added by almost 4 years ago

Due to some unknown reasons, the mailbox of Mr. Đ is full of spam messages and duplicate messages. He owns about 38885 email items in the main INBOX. Because of such big amount of messages, the server cannot process any request in normal way: Mr. Đ can't send email from Thunderbird nor Outlook Express. In many case he can't access INBOX in Thunderbird. Lol :L

The "duplicate remover" add-on of Thunderbird can't do its job because of "Timeout error".

Today I decided to remove all duplicate email via direct access to home directory of Mr. Đ. Because the server doesn't allow me to install anything (it's an out-of-date reason), I should write some shell scripts... I'll use md5sum to check for duplicates. A boring and heavy task.

First, I generate the md5sum of all messages

$ for f in `ls /home/xxx/Maildir/cur/*; do
  md5sum $f >> /home/kyanh/md5sum.txt
done
$ cd /home/kyanh/

Now count how many files we have

$ wc -l < md5sum.txt
38885

Surely we have 38.885 emails. Terrible number!

Second, I get all md5sum numbers. The file md5sum.txt contains both md5sum and the filename.

$ awk '{print $1}' md5sum.txt > only_md5.txt
$ tail -4 only_md5.txt
bef2109c06db25e2ac415dfadce5e901
ac752e4cf747f6c65666385bdf1edb8f
3b159a7fea511b8562d44ed0781e1faa
93830eed85e2c4d002dcbe8b87be9bbc

Third, sort and remove all duplicated items in only_md5.txt and find how many items we have

$ sort -u only_md5.txt > sorted.txt
$ wc -l < sorted.txt
3714

Great! Mr. Đ has only about 3700 messages. This is only 9% of his INBOX. Terrible!

Fourth, gather all duplicate emails and put them into special files

$ for f in `cat sorted.txt`; do
  grep $f md5sum.txt > o/$f
done
$ cat o/001a5bd478afb34968f2e22c70c4babc 
001a5bd478afb34968f2e22c70c4babc  /home/xxx/Maildir/cur/1237540106.15663_2.xxx:2,S
001a5bd478afb34968f2e22c70c4babc  /home/xxx/Maildir/cur/1239765398.P3584Q278M821587.xxx:2,S
001a5bd478afb34968f2e22c70c4babc  /home/xxx/Maildir/cur/1239765463.P3893Q278M676068.xxx:2,S
001a5bd478afb34968f2e22c70c4babc  /home/xxx/Maildir/cur/1239765528.P3923Q278M411317.xxx:2,S
001a5bd478afb34968f2e22c70c4babc  /home/xxx/Maildir/cur/1239765595.P3933Q278M109407.xxx:2,S
001a5bd478afb34968f2e22c70c4babc  /home/xxx/Maildir/cur/1239765657.P3964Q278M456950.xxx:2,S

You see the result contains all messages that have the same md5sum

Fifth, the script to remove messages. It is very silly as I have to write as fast as possible before the SSH session closes.

#!/bin/bash

_REAL=$1

src="`pwd`/o" 

#
# print the second field from the string that likes
# 001a5bd478afb34968f2e22c70c4babc  /home/xxx/Maildir/cur/...
#
ff() {
 awk '{print $2}'
}

#
# keep the file in the first line
# delete the file in the second, third,... line
#
xfile() {
 file=$1
 ok="$2" 
 _count="`wc -l < $file`" 
 if test $_count -ge 2; then
   _keep="`head -1 $file | ff`" 
   _del="`sed -e 1d $file | ff`" 
  echo "keep : $_keep" 
  if test "x$ok" == "xOK"; then
   rm -fv $_del
  else
   echo $_del
  fi
 fi
}
if test "x$_REAL" == "xOK"; then
 for f in $src/*;do
  echo "______________" 
  xfile $f OK
 done
else
 for f in `ls $src/00*|head -4`;do
  echo "______________" 
   xfile $f
 done
fi

It's easy to use this script

$ sh xx.sh    # for test
$ sh xx.sh OK # confirm to delete, after the backup by rsync 

The whole story takes me about 4 hours (include the time to bypass the ISP block mode) -- working remotely :)


Comments

Added by almost 4 years ago

script executed again:

$ cd /root/
$ ln -s /home/xxx/Maildir/.Trash.DEF/cur xxx
$ wc -l 44036
$ wc -l 1546

Added by almost 4 years ago

before deleting: 21G of data
after deleting: 370MB of data

Added by almost 4 years ago

Toàn bộ kịch bản thực hiện công việc ở trên. Cần lưu kịch bản so sánh và xóa trong bài đầu tiên với tên xx.sh

#!/bin/bash

cd $HOME

echo '' > md5sum.txt

find $HOME/xxx/ -exec md5sum {} \; > md5sum.txt

awk '{print $1}' md5sum.txt > only_md5.txt

sort -u only_md5.txt > sorted.txt

rm -f $HOME/o/*

for f in `cat sorted.txt`; do
  grep $f md5sum.txt > $HOME/o/$f
done

./xx.sh OK