-
How to split a file into 2 chunks in bash
Posted on May 9th, 2014 No commentsI have a task to take a file with millions of lines, and to split it in exactly two files, with a controlled top `head` portion part A file, and the remainder of the lines in part B file.
It actually turns out to be somewhat non-trival to split a file in 2 chunks, with a small defined top portion and a large arbitrary bottom portion.
The
split [OPTION]... [INPUT [PREFIX]]
function is really designed to make N chunks out of your file, or M chunks each containing N lines; I debated using it and merging all but the first file back together, but decided to look up sed examples instead.I ended up with the below function. There’s almost certainly a faster/better way to achieve it but this seems to work.
Sample usage
remainderof bigfile.txt 1000
When that runs you will have this result1000_bigfile.txt // this is the top 1000 lines
1000R_bigfile.txt // this is the remainder of the file after splitting off the top 1000 lines
bigfile.txt // the original untouched file
Using optional 3rd argumentremainderof bigfile.txt 1000 true
When that runs you will have this result1000_bigfile.txt // this is the top 1000 lines
bigfile.txt // this is the remainder of the file after splitting off the top 1000 lines
// the original file no longer exists
Here’s the functionalias ll='ls -larth' function remainderof() { thefile=$1 batchsize=$2 if [ ! -z $3 ] then : # $1 was given replaceoriginal=$3 else : # $1 was not given replaceoriginal=false fi if [ $# -lt 2 ] then echo "Usage: $0 filename_to_split rows_to_chop [opt_bool_process_in_place_destructively]" exit 1 fi extension="${thefile##*.}" filename="${thefile%.*}" length=$( wc -l < $thefile ) buffer=$(($length-$batchsize)) echo Splitting $batchsize lines off the top of $filename $extension leaving $buffer from $length ... # The top chunk filename topfile=${filename}_${batchsize}.${extension} # The bottom chunk = remainder of file startofnext=$(($batchsize+1)) remainder=${filename}_${startofnext}-${length}.${extension} echo "Writing $topfile and $remainder" # Split off the first N lines of the file head -n $batchsize $thefile > $topfile # split off the bottom LENGTH - N lines of the file sed "1,${batchsize}d" $thefile > $remainder # whether to leave a copy if [ "$replaceoriginal" = true ] ; then rm -rf $thefile mv $remainder $thefile fi echo `wc -l < $topfile` lines in $topfile echo `wc -l < $remainder` lines in $remainder echo Done }
If you have improvements, leave a comment!