Reading BZip2 Files in Clojure

BZip2 compressed files can easily be read in Clojure thanks to Apache Commons Compress. Sample code inside!

Add an Apache Commons Compress dependency to your Leiningen project.clj file like so:

(defproject bz2reader "0.1.0-SNAPSHOT"
  :dependencies [
                 [org.clojure/clojure "1.3.0"]
                 [org.apache.commons/commons-compress "1.4"] ;; Read/write compressed files (BZip2, etc.)
                 ])

Then, you can make a BZip2 capable reader-generating function as follows:

(ns bz2reader.util
  (:require [ clojure.java.io :as io])
  (:import (org.apache.commons.compress.compressors.bzip2 BZip2CompressorInputStream)
   ))


(defn bz2-reader
  "Returns a streaming Reader for the given compressed BZip2
  file. Use within (with-open)."
  [filename]
  (-> filename io/file io/input-stream BZip2CompressorInputStream. io/reader))

This is based on a line from the fs utilites project, which includes a function to uncompress the file on disk. Rather than send the output of our BZip2 stream to a copy function which writes it to disk, we just return the stream for the user to use in the program. You can use the BZip2-enabled stream with any of the normal Clojure I/O methods, just like any other stream.

Here’s an example of how you can print the contents of a BZip2-compressed file to stdout using the above function:

(defn print-bz2-file [filename]
  (with-open [rdr (bz2-reader filename)]
    (doseq [line (line-seq rdr)]
      (println line))))

It’s easy to create similar readers for GZip files, zip files, and so on using the other Apache Commons Compress classes. You can also write BZip2 compressed files similarly, just by piping your regular OutputStream through the encoder.

3 Responses to Reading BZip2 Files in Clojure

  1. Have you tested this with larger files? It isn’t working for me:


    (count (line-seq (io/reader "data.xml"))) ; 115498403


    (count (line-seq bz2-reader "data.xml.bz2"))) ; 49

    (`data.xml` is 3.0 GB. `data.xml.bz2` is 446 MB. These are wiktionary files, by the way.)

      • This fix worked for me. Note the use of the second form of the BZip2CompressorInputStream constructor, where decompressConcatenated is set to true:

        (defn bz2-reader
        "Returns a streaming Reader for a compressed bzip2 file."
        [filename]
        (-> filename
        io/file
        io/input-stream
        (BZip2CompressorInputStream. true)
        io/reader))