python script to inventorise your ab1 files with md5sums

November 7, 2008 at 11:35 am (opensource, tips) (, , , , )

The problem that needed solving this time was having a list of filenames of my ab1 files, location( directory path) and a md5sum so i know if duplicate filenames are the same file or just a result of misnaming.

managed to come up with this after copying from two different scripts

one that was used to make an inventory of  a directory of ogg songs and the other a python equivalent of md5sum check in linux.

Have fun!

#!/usr/bin/python
#===============================================================================
#
#         FILE:  inventory-abi.py
#
#        USAGE:  ./inventory-abi.py
#
#  DESCRIPTION:  Lists all the files of extension .ab1 with the directory and its md5sum
#  adapated from code from http://pthree.org/2007/08/09/recursion-in-python/ and
#  used md5sum code from http://code.activestate.com/recipes/266486/
#      OPTIONS:  ---
# REQUIREMENTS:  ---
#         BUGS:  will execute md5 on directory as well
#                current method to get CWD is not OS independent
#        NOTES:  ---
#       AUTHOR:  Kevin ,
#      VERSION:  1.0
#      CREATED:  11/07/2008 07:03:16 PM SGT
#     REVISION:  ---
#===============================================================================

import dircache, os, md5
counter = 0

def sumfile(fobj):
    '''Returns an md5 hash for an object with read() method.'''
    m = md5.new()
    while True:
        d = fobj.read(8096)
        if not d:
            break
        m.update(d)
    return m.hexdigest()

def md5sum(fname):
    '''Returns an md5 hash for file fname, or stdin if fname is "-".'''
    if fname == '-':
        ret = sumfile(sys.stdin)
    else:
        try:
            f = file(fname, 'rb')
        except:
            return 'Failed to open file'
        ret = sumfile(f)
        f.close()
    return ret

def PrintFiles(indent):
    global counter
    thisDir = os.getcwd()

    for file in dircache.listdir(thisDir):
        if (file.endswith('ab1') or os.path.isdir(file)) and not file.startswith('.'):
            if file.endswith('ab1'):
                counter += 1

            currdir = os.popen("pwd") #for output of cwd currently works for linux pending upgrade to OS independent
            md5 = md5sum(file) #calls the md5sum function, md5 lib ships with Python

            ab1File.write('%s%s\t%s\t%s\n' %(indent, file, currdir.readline()[:-1], md5))

            if os.path.isdir(file):
                os.chdir(file)
                PrintFiles(indent + '  ')
                os.chdir('../')

try:
    ab1File = open('ab1files.txt', 'w')
except IOError, e:
    print "Unable to open 'ab1files.txt' for writing: ", e
else:
    PrintFiles('')
    ab1File.write('\nCurrent number of ab1 files: %d\n\n' %(counter))
    ab1File.close()
Advertisements

4 Comments

  1. Jason Creighton said,

    This seems like a lot of work for something fairly simple. I would tend to use the shell for something like this. On a Linux/Unix box, you can do:

    /tmp/test$ find -name '*.ab1' -exec md5sum {} +
    2b00042f7481c7b056c4b410d28f33cf  ./subdir1/test2.ab1
    c157a79031e1c40f85931829bc5fc552  ./subdir2/test4.ab1
    d3b07384d113edec49eaa6238ad5ff00  ./subdir2/test3.ab1
    b1946ac92492d2347c6235b4d2611184  ./test1.ab1
    /tmp/test$ 
    

    This had pretty much the same effect, although it’s not indented like the output of your script.

    • aboulia said,

      Hi,
      Thanks
      that’s certainly true.. and i found that python’s recursion is really slow.
      Maybe I am doing it wrong.

      possible applications of this is that
      1) I can do it on my colleague’s winxp box
      2) I can do other stuff than to just get the md5 hash (maybe insert the info into a sqlite db)

      • Jason Creighton said,

        Well, in fact you can’t do this on your colleague’s Windows XP box, because you call out to “pwd” to get the current directory for some reason, which does not work on Windows. (Why not just use os.getcwd()?) Also, that causes you to spawn a new process on every file you visit, which may be why you’re finding the recursion to be a little slow.

        And you’re walking the directory tree manually when there’s a nice helper for that, os.walk. Here’s a simpler version:

        from __future__ import with_statement
        
        import os
        import hashlib
        
        def md5sum(filename):
            h = hashlib.md5()
            with open(filename, 'rb') as f:
                while True:
                    block = f.read(4096)
                    if not block:
                        break
        
                    h.update(block)
        
            return h.hexdigest()
        
        for dirpath, dirnames, filenames in os.walk(os.getcwd()):
            for filename in filenames:
                if filename.lower().endswith('.abi'):
                    abs_path = os.path.join(dirpath, filename)
                    print "%s %s" % (md5sum(abs_path), abs_path)
        

        Again, no indenting, but that could easily be added by looking at the number of path components in the current filename.

      • Kevin Lam said,

        Thanks for the upgrade to my code!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: