Ramblings on technology with a dash of social commentary
RSS icon Email icon Home icon
  • How to split a file into 2 chunks in bash

    Posted on May 9th, 2014 phpguru No comments

    I have a task to take a file with millions of lines, and to split it in exactly two files, with a controlled top `head` portion part A file, and the remainder of the lines in part B file.

    It actually turns out to be somewhat non-trival to split a file in 2 chunks, with a small defined top portion and a large arbitrary bottom portion.

    The split [OPTION]... [INPUT [PREFIX]] function is really designed to make N chunks out of your file, or M chunks each containing N lines; I debated using it and merging all but the first file back together, but decided to look up sed examples instead.

    I ended up with the below function. There’s almost certainly a faster/better way to achieve it but this seems to work.

    Sample usage

    remainderof bigfile.txt 1000
    When that runs you will have this result

    1000_bigfile.txt // this is the top 1000 lines
    1000R_bigfile.txt // this is the remainder of the file after splitting off the top 1000 lines
    bigfile.txt // the original untouched file

    Using optional 3rd argument

    remainderof bigfile.txt 1000 true
    When that runs you will have this result

    1000_bigfile.txt // this is the top 1000 lines
    bigfile.txt // this is the remainder of the file after splitting off the top 1000 lines
    // the original file no longer exists

    Here’s the function

    alias ll='ls -larth'
    
    function remainderof() {
    
        thefile=$1
        batchsize=$2
            if [ ! -z $3 ]
                    then
            : # $1 was given
            replaceoriginal=$3
            else
            : # $1 was not given
            replaceoriginal=false
            fi
    
        if [ $# -lt 2 ]
        then
             echo "Usage: $0 filename_to_split rows_to_chop [opt_bool_process_in_place_destructively]"
             exit 1
        fi
    
        extension="${thefile##*.}"
        filename="${thefile%.*}"
        length=$( wc -l < $thefile )
        buffer=$(($length-$batchsize))
        echo Splitting $batchsize lines off the top of $filename $extension leaving $buffer from $length  ...
        #  The top chunk filename
        topfile=${filename}_${batchsize}.${extension}
    
        # The bottom chunk = remainder of file
        startofnext=$(($batchsize+1))
        remainder=${filename}_${startofnext}-${length}.${extension}
        echo "Writing $topfile and $remainder"
    
        # Split off the first N lines of the file
        head -n $batchsize $thefile > $topfile
    
        # split off the bottom LENGTH - N lines of the file
        sed "1,${batchsize}d" $thefile > $remainder
    
        # whether to leave a copy
        if [ "$replaceoriginal" = true ] ; then
            rm -rf $thefile
            mv $remainder $thefile
        fi
    
        echo `wc -l < $topfile` lines in $topfile
        echo `wc -l < $remainder` lines in $remainder
    
    
        echo Done
    }
    

    If you have improvements, leave a comment!