This page is READ-ONLY. It is generated from the old site.
All timestamps are relative to 2013 (when this page is generated).
If you are looking for TeX support, please go to

I am an email killer

Added by almost 4 years ago

Due to some unknown reasons, the mailbox of Mr. Đ is full of spam messages and duplicate messages. He owns about 38885 email items in the main INBOX. Because of such big amount of messages, the server cannot process any request in normal way: Mr. Đ can't send email from Thunderbird nor Outlook Express. In many case he can't access INBOX in Thunderbird. Lol :L

The "duplicate remover" add-on of Thunderbird can't do its job because of "Timeout error".

Today I decided to remove all duplicate email via direct access to home directory of Mr. Đ. Because the server doesn't allow me to install anything (it's an out-of-date reason), I should write some shell scripts... I'll use md5sum to check for duplicates. A boring and heavy task.

First, I generate the md5sum of all messages

$ for f in `ls /home/xxx/Maildir/cur/*; do
  md5sum $f >> /home/kyanh/md5sum.txt
$ cd /home/kyanh/

Now count how many files we have

$ wc -l < md5sum.txt

Surely we have 38.885 emails. Terrible number!

Second, I get all md5sum numbers. The file md5sum.txt contains both md5sum and the filename.

$ awk '{print $1}' md5sum.txt > only_md5.txt
$ tail -4 only_md5.txt

Third, sort and remove all duplicated items in only_md5.txt and find how many items we have

$ sort -u only_md5.txt > sorted.txt
$ wc -l < sorted.txt

Great! Mr. Đ has only about 3700 messages. This is only 9% of his INBOX. Terrible!

Fourth, gather all duplicate emails and put them into special files

$ for f in `cat sorted.txt`; do
  grep $f md5sum.txt > o/$f
$ cat o/001a5bd478afb34968f2e22c70c4babc 
001a5bd478afb34968f2e22c70c4babc  /home/xxx/Maildir/cur/,S
001a5bd478afb34968f2e22c70c4babc  /home/xxx/Maildir/cur/,S
001a5bd478afb34968f2e22c70c4babc  /home/xxx/Maildir/cur/,S
001a5bd478afb34968f2e22c70c4babc  /home/xxx/Maildir/cur/,S
001a5bd478afb34968f2e22c70c4babc  /home/xxx/Maildir/cur/,S
001a5bd478afb34968f2e22c70c4babc  /home/xxx/Maildir/cur/,S

You see the result contains all messages that have the same md5sum

Fifth, the script to remove messages. It is very silly as I have to write as fast as possible before the SSH session closes.




# print the second field from the string that likes
# 001a5bd478afb34968f2e22c70c4babc  /home/xxx/Maildir/cur/...
ff() {
 awk '{print $2}'

# keep the file in the first line
# delete the file in the second, third,... line
xfile() {
 _count="`wc -l < $file`" 
 if test $_count -ge 2; then
   _keep="`head -1 $file | ff`" 
   _del="`sed -e 1d $file | ff`" 
  echo "keep : $_keep" 
  if test "x$ok" == "xOK"; then
   rm -fv $_del
   echo $_del
if test "x$_REAL" == "xOK"; then
 for f in $src/*;do
  echo "______________" 
  xfile $f OK
 for f in `ls $src/00*|head -4`;do
  echo "______________" 
   xfile $f

It's easy to use this script

$ sh    # for test
$ sh OK # confirm to delete, after the backup by rsync 

The whole story takes me about 4 hours (include the time to bypass the ISP block mode) -- working remotely :)


Added by almost 4 years ago

script executed again:

$ cd /root/
$ ln -s /home/xxx/Maildir/.Trash.DEF/cur xxx
$ wc -l 44036
$ wc -l 1546

Added by almost 4 years ago

before deleting: 21G of data
after deleting: 370MB of data

Added by almost 4 years ago

Toàn bộ kịch bản thực hiện công việc ở trên. Cần lưu kịch bản so sánh và xóa trong bài đầu tiên với tên


cd $HOME

echo '' > md5sum.txt

find $HOME/xxx/ -exec md5sum {} \; > md5sum.txt

awk '{print $1}' md5sum.txt > only_md5.txt

sort -u only_md5.txt > sorted.txt

rm -f $HOME/o/*

for f in `cat sorted.txt`; do
  grep $f md5sum.txt > $HOME/o/$f

./ OK