A look at csplit
- Patterns and repeat counters
- File naming
- Suppressing the matching lines
- Bug 42764: the last match is not suppressed
- Conclusion
Most Linux users that regularly use the terminal are aware of the GNU
Coreutils, an extensive collection of utilities that
includes things like sort
, uniq
, cut
and cat
, and that all of us
use daily to perform file manipulation tasks.
Among them, there is split
, a tool which can be used to divide a file
into smaller parts, each one stored into a individual file. It can
operate in various ways, the most common being making each part exactly
N
bytes large or containing N
lines of text. For example, one can
use split
to divide a large ISO file into smaller parts for
transmission, and then reassemble the pieces using cat
.
split
has a lesser known brother: csplit
. Unlike split
,
csplit
performs context-based splitting, meaning that instead of
simply splitting a file after a fixed number of bytes or lines, it looks
for specific markers to act as separators. Such markers can be provided
as regular expressions, which are looked in the input file. Each time a
regexp matches, the program outputs everything that comes before that
line to a new file. Then, it starts again from the matched line and
looks for the next match, repeating the process until the input is
exhausted. File creation and numbering is handled by the tool, just like
split
does.
Suppose we have a multi-document YAML file:
key1: value1
---
key2: value2
---
key3: value3
We need to split it so that each document goes into its own file (the
lines starting with #
are not part of the files, they were added to
mark the beginning of each individual file, a la head -v
):
# ==> /tmp/test-000.yaml <==
key1: value1
# ==> /tmp/test-001.yaml <==
key2: value2
# ==> /tmp/test-002.yaml <==
key3: value3
This is a perfect use case: we want to isolate specific portions of the
file depending on where the YAML document marker ---
appears.
Patterns and repeat counters
The general form of the csplit
command line is:
csplit [OPTIONS]... FILE PATTERNS...
where FILE
is the path to the file we want to split, and PATTERNS
are one or more regular expressions used to match the next separation
line. Expressions follow the BRE (Basic Regular Expression)
syntax, so things like +
and |
must be preceded by a backslash to be
recognized as metacharacters. Also, regexps are always enclosed between
a pair of /
characters, so they look like /a.*b/
.
In our example, we want to match the line containing only 3 dashes, so
we would use the expressions /^---$/
. Beginning of line and end of
line anchors are used to ensure that lines containing three dashes in
the middle are not mistakenly interpreted as document separators.
Each regexp, by default is used only once. If we were to run:
csplit test.yaml '/^---$/'
we would end of with two new files, not three, the first one will
contain the first document, the second one the remaining two. To allow
for regexps to be reused, they can be followed by a repeat counter, a
positive integer enclosed in {}
that causes the expression to be
matched multiple times. For example, {2}
means that the regexp must be
matched two more times, in addition to the single match implied by the
regexp itself, so that the program would attempt a total of three
matches before moving on to the next regexp. Each repeat count only
applies to the expression immediately preceding it. For example:
csplit test.yaml '/^---$/' '{1}'
would split our file correctly, as it split on two document separators, once for the expression itself and once because of the repetition.
As a special case, one can use an asterisk in place of a number to mean split as many times as you can. Thus, the following is equivalent to the previous example:
csplit test.yaml '/^---$/' '{*}'
Update: while the previous statement ought to be true according to
the documentation, there is a bug in current csplit versions which
causes asterisk repetition to behave differently from using a fixed
number when the --suppress-matched
option is used. See
below for the details.
csplit
also dumps some information to standard output: the sizes of
all files produced, one per line. So in our case we expect it to print 3
integers: the sizes of the 3 output files.
File naming
So far so good, but if we look at the filenames csplit
constructs for
new files, they don’t tell us much about the original file they come
from. By default, each file name is constructed by appending a 2-digit
decimal counter to the prefix xx
, so our files are going to be called
xx00
, xx01
and so on. This is hardly useful.
A few options allow us to tweak file naming:
-
-f
replaces the prefix. Instead ofxx
we may usetest-
, so that is is evident which file generated the pieces; -
-b
replaces the numeric suffix. It can contain literal text in addition to the counter, and it uses a singleprintf
-style placeholder to specify where the counter should be expanded. In our case, we would like for our files to end with a 3-digit counter and the.yaml
extension. We could therefore pass this option the value%03u.yaml
, which causes suffixes like000.yaml
,001.yaml
and so on to be used; -
-n
is a simpler alternative to-b
, which changes the width of the numeric suffix but does not allow for additional text and therefore does not require any placeholder.
Let’s try again with -f
and -b
:
csplit -f 'test-' -b '%03u.yaml' test.yaml '/^---$/' '{*}'
And this is what it produces in the current directory:
test.yaml
test-000.yaml
test-001.yaml
test-002.yaml
Suppressing the matching lines
If we now look inside the output files:
# ==> /tmp/test-000.yaml <==
key1: value1
# ==> /tmp/test-001.yaml <==
---
key2: value2
# ==> /tmp/test-002.yaml <==
---
key3: value3
there is still something wrong. The lines matching the regexp were
included at the beginning of the next output file. With the exception of
the first file, all other YAML documents start with ---
. This is not
what we wanted.
This is because we told csplit
to divide the files at specific lines,
but we never told it that those lines were to be discarded.
Luckily, there is an option that does exactly that:
--suppress-matched
:
csplit -f 'test-' -b '%03u.yaml' --suppress-matched \
test.yaml '/^---$/' '{*}'
This time the output is correct:
# ==> /tmp/test-000.yaml <==
key1: value1
# ==> /tmp/test-001.yaml <==
key2: value2
# ==> /tmp/test-002.yaml <==
key3: value3
It is not always meaningful to use this option: if we were to split a
Markdown file into sections by looking at lines starting with #
, we
don’t want titles to be thrown away.
Bug 42764: the last match is not suppressed
Up to and including the current version of csplit (coreutils v8.32)
there is a bug which causes the last match in a file to be suppressed
only when using {*}
repetition. Using a fixed number equal to the
expected total number of matches (minus the regular expression itself)
will cause the last segment to contain the matched line.
Let’s try splitting out sample file using {1}
instead of {*}
.
Theoretically, they should be equivalent since {1}
matches two
times, once because of the regular expression itself and once because of
the counter, and we know the file contains just two marker lines.
However, if we try it:
csplit -f 'test-' -b '%03u.yaml' --suppress-matched \
test.yaml '/^---$/' '{1}'
the output files contain:
# ==> /tmp/test-000.yaml <==
key1: value1
# ==> /tmp/test-001.yaml <==
key2: value2
# ==> /tmp/test-002.yaml <==
---
key3: value3
As you can see, the last part still contains the marker, something that
didn’t happen with {*}
.
The good news is that the bug has been reported and will probably be fixed in the next coreutils release. The bad news is that most systems will have to cope with older versions of this package for quite some time, so better be aware of this gotcha.
Conclusion
csplit
can be a little time saver when you need to do exactly what it
was designed to do: split a file into parts using specific lines as
separators.
Although I explained the core features, it provides options and functionalities I didn’t mention, so make sure tho have a look at its manpage for the full details.