Yahoo Finance in Clojure

A Clojure version of my Yahoo Finance Ruby gem seemed like an interesting challenge and a good way to learn Clojure better. This version uses Apache HttpClient, which is significantly slower than libcurl. A libcurl version is on the way.

Project requirements and background

We want a fast and efficient way to download thousands of symbols’ worth of historic stock data from Yahoo Finance. Parallel HTTP queries and persistent connections are, therefore, required.

Technomancy’s clojure-http-client is beautifully written idiomatic Clojure, but unfortunately it does not yet support connection pooling. Needless to say, the built-in Java standard library HTTP client don’t seem to have it, either. I thought of libcurl, of course, but its Java bindings (and older bindings that the official site says to avoid) do not seem to be particularly complete or maintained. Googling turned up a HTTPClient last updated in 2001, a commercial package, and the free Apache HttpClient project, which seems to be reasonably active and to do almost everything that the Oakland Software commercial package does (the comparison on their website seems to have been written in 2003, when they had several features that Apache lacked, but which have since been implemented.) Oakland Software’s client has pipelining, which none of the others do, but that alone isn’t worth the price.

I went with Apache HttpClient. I figured that it would be easier to use than libcurl’s Java bindings (ha!) as it presented a higher-level user interface. (It turns out that whatever ease-of-use gain there was to be had there is more than compensated for by its confused documentation and general overengineered Java coding style.)

Apache HttpClient’s docs are a little out of date. I found an example of multithreaded parallel HTTP requests and the JavaDocs and went from there.

Clojure code

I began by translating the general outline of the Java example to Clojure, though it pained me to see such imperativeness in parentheses. I made a few updates: I changed it to use the Executor threading system rather than creating threads manually, and of course I used a Clojure function instead of a thread (since all Clojure functions implement Callable, which is really convenient.) I also used Joda-Time to replace the obscene GregorianCalendar/Date classes that come with Java.

The latest version of this code is available from GitHub.

;; clojure-yahoo-finance: Clojure interface to Yahoo! Finance historic stock data.
;;
;; Copyright (C) 2010 Paul Legato. All rights reserved.
;; Licensed under the BSD-new license: see the file LICENSE for details.
;;
;; This code is not endorsed by or connected with Yahoo in any way.
;;
(ns clojure-yahoo-finance.core
  (:import (org.apache.http 
            HttpResponse
            HttpVersion
            client.HttpClient
            client.methods.HttpGet
            conn.scheme.PlainSocketFactory
            conn.scheme.Scheme
            conn.params.ConnManagerParams
            conn.params.ConnPerRouteBean
            conn.scheme.SchemeRegistry
            impl.client.DefaultHttpClient
            impl.conn.tsccm.ThreadSafeClientConnManager
            params.BasicHttpParams
            params.HttpConnectionParams
            params.HttpProtocolParams
            protocol.BasicHttpContext
            util.EntityUtils
            )
           (java.util.concurrent Executors)
           (org.joda.time LocalDate)
           ))

(defn- yahoo-request-thread
  "Internal function used as the thread to retrieve historic data for symbol from startDate to endDate via httpClient.

   The value will be either a map with the symbol name and the raw CSV data from Yahoo, or an HTTP status code (e.g. 404 means the symbol was not found.)

   Dates can be of any class that Joda-Time understands (including a 'YYYY-MM-DD' String.)
"
  [httpClient symbol startDate endDate]
   (let [
         start (LocalDate. startDate)
         end (LocalDate. endDate)

         httpGet (new HttpGet 
                      (str "http://itable.finance.yahoo.com" 
                           "/table.csv?s=" symbol "&g=d" +
                           "&a=" (- (.getMonthOfYear start) 1)
                           "&b=" (.getDayOfMonth start)
                           "&c=" (.getYear start)
                           "&d=" (- (.getMonthOfYear end) 1)
                           "&e=" (.getDayOfMonth end)
                           "&f=" (.getYear end)))

         ]
     (try
      (let [httpResponse  (. httpClient execute httpGet (new BasicHttpContext))
            entity (. httpResponse getEntity)
            data (if (= 200 (.. httpResponse getStatusLine getStatusCode))
                   (. EntityUtils toString entity)
                   404)]
        (.consumeContent entity) ;; required to release the connection back to the pool? Docs are unclear on whether EntityUtils does this itself.
        {symbol data}
        )
      (catch Exception e
        (.abort httpGet)
        (throw e)))))



(defn blocking-query
  "Returns historic data between startDate and endDate for the seq of symbols.

Options:
  :maxConnections - default 200.
  :threadPoolSize - default 300. Should be larger than maxConnections
                    for optimal efficiency. Set to 0 for an unbounded thread pool - be
                    advised that doing so may exhaust available memory if you're requesting a lot of symbols
                    relative to the amount of free memory and thread-creation overhead on your system.

The optimal values for these parameters are system-dependent.

"
  [startDate endDate symbols & options]
  (let [opts (when options (apply assoc {} options))

        params (new BasicHttpParams)
        schemeRegistry (new SchemeRegistry)

        results {}

        threadPoolSize (or (:threadPoolSize opts) 300)
        threadPool (if (= threadPoolSize 0) 
                     (Executors/newCachedThreadPool)
                     (Executors/newFixedThreadPool threadPoolSize))
        ]

    ;; Wow, this is ugly. Now I remember why I stopped using Java.
    (. HttpProtocolParams setVersion params HttpVersion/HTTP_1_1)
    (. HttpProtocolParams setUserAgent params "Clojure-Yahoo-Finance")

    (. HttpConnectionParams setTcpNoDelay params true)

    ;; If these are turned too low, then you will get timeouts when running very large queries, because 
    ;; all threads are started at once and each will block until it is allocated an HTTP connection from 
    ;; the pool.
    (. HttpConnectionParams setConnectionTimeout params 100000)
    (. ConnManagerParams setTimeout params 100000)


    ;; These methods are marked deprecated in 4.1-alpha1, but their replacements don't work.
    (. ConnManagerParams setMaxConnectionsPerRoute params (new ConnPerRouteBean (or (:maxConnections opts) 200)))
    (. ConnManagerParams setMaxTotalConnections params (or (:maxConnections opts) 200))
    
    (. schemeRegistry register (new Scheme "http" (. PlainSocketFactory getSocketFactory) 80))
    
    (let [connectionManager (new ThreadSafeClientConnManager params schemeRegistry)
          httpClient (new DefaultHttpClient connectionManager params)
          ]

      ;; Broken in 4.1-alpha1 - need to use the deprecated method above instead
      ;;
      ;;(doto connectionManager
      ;; (.setMaxTotalConnections (or (:maxConnections opts) 200)))
      ;;        (.setDefaultMaxPerRoute (or (:maxConnections opts) 200))) 

      (try
       (let [tasks (map (fn [sym]
                          (fn [] (yahoo-request-thread httpClient sym startDate endDate)))
                        symbols
                        )
             futures (map (fn [x] (.get x)) (.invokeAll threadPool tasks))]
         (reduce conj futures))
         
         
         (finally
          (.. httpClient getConnectionManager shutdown)
          (.shutdown threadPool)))
       ))))

Notes

  • The optimal values for :maxConnections and :threadPoolSize depend on how much free RAM is available to the Clojure process, how much memory overhead is involved in creating a new thread (JVM/OS dependent), and the latency characteristics of the network link between the user and Yahoo. The defaults should be reasonable for most modern desktop computers and broadband network environments, but those with access to particularly fast network pipes and with lots of RAM to spare may be able to get better performance by tweaking them. Note that you have to give java the -Xmx flag at startup to actually make more free RAM available to the JVM.
  • Although using ConnManagerParams to set the maximum connections (lines 106-7) is marked deprecated in the JavaDocs for 4.1-alpha1, their replacements (lines 118-120) don’t actually do anything, so I had to use the deprecated methods instead. This took me a while to figure out, since of course this isn’t mentioned anywhere in the docs.
  • The docs are inconsistent and vague about how an HTTP connection gets returned to the thread pool. Compounding the problem is that many of the docs seem to have been written for version 3.x, where it was apparently required to manually release connections. One doc mentions that 4.x automatically returns connections to the pool, another says you need to call either abort on the connection or consumeContent on an entity (included at line 60, just to be safe.)

No refs

My first version passed a ref to each thread and had it update it itself with its results That works fine, but I suspected that refs are not the most efficient way to go about this, as they incur transaction overhead in each thread. Profiling proved this theory out. Downloading one month of AAPL with the refs version above took about 0.60 seconds on average. I rewrote it to simply return the result from the thread rather than store it in a ref itself. The calling function then aggregates the results of all threads with reduce. This version gives us a slight performance boost, averaging around 0.45 seconds. (Subsequent additional optimizations boosted performance further, to the 0.24 seconds in the chart below.)

pmap / reduce version

A guy on IRC pointed me to pmap/reduce as a more Clojureish way of doing parallel calculation aggregation. This looks very interesting and is on the list for future projects.

Profiling

Tests were repeated 10 times, and the runtimes were averaged.

Test clojure-yahoo-finance Ruby / yahoofinance-typhoeus
AAPL Jan 2009 241 ms 146 ms
AAPL all of 2009 609 ms 267 ms

Clojure-yahoo-finance is quite a bit slower than the Ruby/Typhoeus version, but not so much slower as to render it unusable for most purposes. I suspect this is mainly because Typhoeus uses libcurl, written in C and heavily optimized over many years for speed, for all of its HTTP processing, whereas the Clojure version is using the pure-Java Apache HttpClient, replete with all the usual overengineered cruft one associates with Java, such as 5 million typed helper setter methods for generic “configuration options,” plus a helper class to actually set the 2 configuration options that exist, and Bean objects that do nothing but wrap an integer.

The next step is to figure out the Java bindings for libcurl and write a Clojure version with that. Stay tuned.

5 Responses to Yahoo Finance in Clojure

    • Very nice, thanks. I much prefer your version :)
      Apache’s very bloated, so yours is probably faster.

      I suspect libcurl multi-mode would be still faster – it puts all connections in one thread, and so avoids thread creation/teardown overhead – but I haven’t been able to find Java bindings that enable multi-mode yet.. and at this stage it’s not worth the few extra seconds it would yield to write them myself.

      Thanks for sharing the code.

      Cheers,
      Paul

  1. Cool (did not get to respond to ur msg b4 as lost power for 5 days).

    As for Apache, agree with you completely. Most of the Apache software is heavyweight and over-engineered. I try to stay away from anything Apache even in Java-land.

  2. You might be interested in clj-apache-http, which does all the HttpClient nonsense for you.

    Look on GitHub or Clojars.