Difference in Methodologies.

by debianjoe

The other day, in a random conversation, I showed off a little C based implementation that did a simple conversion of Hex to Decimal translation.  It wasn’t a big program, but within minutes after posting it, there was the expected “bash response” which is where you take something that’s been posted and then do it in as few characters as possible in a totally unrelated language.  In this case, it was bash.

The whole thing sat wrong with me, because I was simply sharing something neat, and I guess that the “but you can do it in just two lines of bash/sh” could have been intended to simply be showing another way to do it.  This is all really beside the point that I want to make, but it leads back to why I started learning traditional C in the first place.  C is not at all the easiest language to write in, but what it lacks in ease of writing, it makes up for in performance and leaving a minimal system footprint if implemented correctly.  When writing for Unix based systems, the elegance of the layers of your programming can make a huge difference.

To really make my point, we need some way to test solid performance.  Luckily, there’s a shell script that Dennis Williams wrote that creates a pure bash implementation of hexdump.  I’ll share it, so that if you wish to recreate my example, you can.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

#!/bin/bash
# bash-hexdump
# by Dennis Williamson - 2010-01-04
# usage: bash-hexdump file

if [[ -z "$1" ]]
then
    exec 3<&0                           # read stdin
    [[ -p /dev/stdin ]] || tty="yes"    # no pipe
else
    exec 3<"$1"            # read file
fi

# if the script name contains "stream" then output will be continuous hex digits
# like hexdump -ve '1/1 "%.2x"'
[[ $0 =~ stream ]] && nostream=false || nostream=true

saveIFS="$IFS"
IFS=""                     # disables interpretation of \t, \n and space
saveLANG="$LANG"
LANG=C                     # allows characters > 0x7F
bytecount=0
valcount=0
$nostream && printf "%08x  " $bytecount
while read -s -u 3 -d '' -r -n 1 char    # -d '' allows newlines, -r allows \
do
    ((bytecount++))
    printf -v val "%02x" "'$char"    # see below for the ' trick
    [[ "$tty" == "yes" && "$val" == "04" ]] && break    # exit on ^D
    echo -n "$val"
    $nostream && echo -n " "
    ((valcount++))
    if [[ "$val" < 20 || "$val" > 7e ]]
    then
        string+="."                  # show unprintable characters as a dot
    else
        string+=$char
    fi
    if $nostream && (( bytecount % 8 == 0 ))      # add a space down the middle
    then
        echo -n " "
    fi
    if (( bytecount % 16 == 0 ))   # print 16 values per line
    then
        $nostream && echo "|$string|"
        string=''
        valcount=0
        $nostream && printf "%08x  " $bytecount
    fi
done

if [[ "$string" != "" ]]            # if the last line wasn't full, pad it out
then
    length=${#string}
    if (( length > 7 ))
    then
        ((length--))
    fi
    (( length += (16 - valcount) * 3 + 4))
    $nostream && printf "%${length}s\n" "|$string|"
    $nostream && printf "%08x  " $bytecount
fi
$nostream && echo

LANG="$saveLANG";
IFS="$saveIFS"

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

This seems like a nice way to set up a pure bash implementation vs a C implementation of a relatively involved process.  Now, we need to create some tests.  Let’s time this scripted one vs. the original C hexdump.  I’ll be using the trusty t43 that I use most of the time at home.

time ./hexdump.sh /bin/bash > /dev/null

 

  real 1m 41.563s, user 1m 33.008s, sys 0m 3.541s

time hexdump /bin/bash > /dev/null

  real 0m 0.326s, user 0m 0.326s, sys 0m 0.001s

Now, there is probably a way to make Dennis’s script more optimized for performance.  Also, bash doesn’t handle binary data very well because of the way that it reads characters (scanning each character to find nulls), but still the difference in the two implementations should be blatantly obvious.  Whereas bash is a fantastic way to perform simple operations, and especially useful as a user-interface, for more complex operations it’s slow.  If you’re looking at chaining programs together (which is how the entire Unix eco-sphere is designed to work), then using the fastest option at each possible step only makes sense.

We jokingly refer to bloat at the LinuxBBQ, for absolutely everything from RAM usage to how big someone’s vehicle may be, but this is one of the few points where I think a little introspection could do us some good.  Bloat in coding is fine for some things in userspace, but if one language can complete the same task in 0.2302% of the time of the other, then it is certainly preferable in any nested situation.

Advertisements