Yahoo Finance in Ruby with Typhoeus

Yahoo! Finance allows clients to download free historic stock data. Transparentech’s YahooFinance Ruby gem works well, but getting historic data for a large number of symbols is painfully slow. YahooFinance (the gem)’s download of the end-of-day historic quotes for about 10,000 symbols was taking 2 hours or more. With Typhoeus and its parallel connections, I can run the same download in under 7 minutes.

Paul Dix’s Typhoeus appeared while Googling today. Not only is it an awesomely fast wrapper for libcurl, it allows you to queue requests and then execute them concurrently.

Update: I’ve just released a Ruby gem based on this code. It’s on Gemcutter, and the source is on GitHub. You can install it by just doing a “sudo gem install yahoofinance-typhoeus“.

Why the YahooFinance gem is slow: Net::HTTP, serial connections

YahooFinance uses the Net::HTTP classes internally, which are extremely slow (the former helpfully summarizing his inspection of the code as “Ruby’s Net::HTTP implementation blows”.) There are several alternatives mentioned in those tests. I’ve settled on Typhoeus for now.

Typhoeus uses libcurl to do the heavy lifting. Libcurl is robust and well-tuned after years of real-world use. While I don’t know if it’s the fastest Ruby HTTP library available, it’s orders of magnitude better than Net::HTTP, and it’s Good Enough™ for now.

Net::HTTP is undoubtely far and away the largest single performance bottleneck in YahooFinance. There are other potential issues, but they are comparatively minor and could all easily be avoided during the conversion to Typhoeus anyway. For example…

Yahoo Finance (the actual web service, not the gem) apparently won’t return more than 200 days of historic data for some foreign stocks (according to the comments in the gem), so YahooFinance the gem includes chunking logic to retry and aggregate requests for more than that amount of data. This minorly impacts performance. I won’t ever be requesting more than 200 days in the production system (which downloads new data nightly) anyway, and I have very few foreign stocks. I suppose this will require revisiting and implementing chunking logic when I bootstrap a new server, but for this phase of testing, this is not an issue.

  1. An entirely new TCP socket to Yahoo was opened and then torn down for each and every symbol, incurring that overhead every time. Libcurl uses persistent connections by default. (I’m not sure whether Typhoeus takes advantage of this functionality – it’s Fast Enough that I haven’t looked into it.) I imagine this is the major issue, since the amount of data transferred per request is not large enough to exceed even the 1k buffer in Ruby 1.8.6, much less the 16k one in 1.8.7.
  2. The chunking logic discussed above.
  3. YahooFinance (the gem) was then creating an entire new HistoricalQuote object for every symbol to pass back to my code. It apparently has an array-based option for better performance, but it was only slightly more trouble to refactor to Typhoeus than to that, so I took the opportunity to ditch Net::HTTP.
  4. Not only does it use Net::HTTP, it uses Net::HTTP::Proxy. This is probably only a marginal slowdown, but on the other hand it sure wasn’t making it any faster.
  5. YahooFinance.gem uses CSV.parse to parse the raw historic data sent back by Yahoo. Probably not all that big of a deal, but again, it was certainly not helping either. Why bother? I just split on \n, loop on that, and split on comma instead. I’m going to loop on it anyway to insert it into the database. Why loop twice?

Typhoeus code conversion

Typhoeus’ callback model required some minor modifications to my data loader design. My previous system was a rapid prototype designed for quick coding rather than performance (premature optimization being the root of all evil, as the sage said.) In short, I grabbed the list of stock symbols from the database and looped on them, doing a YahooFinance historic data request for each and stuffing the result back into the database. This worked fine for 20 or 30 or 100 symbols, but not so well with 10,000 symbols.

The overall logic with Typhoeus is similar: loop on the stock list, but instead of grabbing the data for each, queue a request with Typhoeus’ hydra dispatcher. Once all the requests have been queued, a blocking call to hydra.run begins the concurrent HTTP downloads.

Typhoeus does not seem to have a non-blocking way to run the HTTP queue. A further optimization would fork off a new thread to do the blockng run call in the background while still processing the database stock list in the main thread, then join the Hydra thread after all requests for all symbols have been queued. However, Ruby uses green threads – might this be an issue for running the C-based libcurl that powers Typhoeus under the hood? In any case, the code is now fast enough that I don’t have to worry about this optimization right now.

Yahoo Finance with Typhoeus Code

Here is a demo of the Yahoo Finance with Typhoeus code. I haven’t included app-specific parts for getting the stock symbols from my database or storing data from Yahoo Finance into the database. It should be easily adaptable to other applications. I’m thinking of writing a little wrapper around this and publishing it as a Ruby Gem.

Update: Gem now available at Gemcutter, and source on GitHub. You can install it with “gem install yahoofinance-typhoeus“.

#!/usr/bin/env ruby
#
# Load historic data for the given symbols for the last 14 calendar
# days. This will return varying amounts of data depending on how many
# trading days there were within the last 14 calendar days at the time
# you run this.
#
# Copyright (C) 2010 Paul Legato. License to use this code is granted
# under the terms of the GNU Public License version 2.
#

require 'rubygems'
require 'typhoeus'
require 'date'

Symbols = [ 'AAPL', 'GOOG', 'NonexistentSymbol', 'INTC', ]

Today = Date.today
StartDate = Today - 14

hydra = Typhoeus::Hydra.new(:max_concurrency => 20)
hydra.disable_memoization # save memory, since we won't ever use the memo

Symbols.each{|symbol|
  puts "Querying for #{ symbol }..."

  url = "http://itable.finance.yahoo.com" +
  "/table.csv?s=#{ symbol }&g=d" +
  "&a=#{ StartDate.month - 1 }&b=#{ StartDate.mday }&c=#{ StartDate.year }" +
  "&d=#{ Today.month - 1 }&e=#{ Today.mday }&f=#{ Today.year.to_s }"

  request = Typhoeus::Request.new(url, :method => :get)

  request.on_complete {|response|
    if response.code == 200
      if response.body[0..40] != "Date,Open,High,Low,Close,Volume,Adj Close"
        puts " * Error: Unknown response body from Yahoo - #{ response.body[0..40] } ..."
      else
        # good response. go.
        count = 0

        response.body.split("\n").each{|line|
          next if line[0..40] == "Date,Open,High,Low,Close,Volume,Adj Close" # skip header line

          # There's no point splitting it if we're just going to print it
          # back out as CSV, but this is a demo. You would typically be
          # stuffing this into a database here.
          data = line.split(",")
          puts("#{ symbol }," +
               "#{ data[0] }," + # date
               "#{ data[1] }," + # open
               "#{ data[2] }," + # high
               "#{ data[3] }," + # low
               "#{ data[4] }," + # close
               "#{ data[5] }," + # volume
               "#{ data[6] }") # adjusted close

          count += 1
        } # end each line of response

        puts "* #{ symbol } - loaded #{ count } days." if count > 0
      end
    elsif response.code == 404
      puts "#{ symbol } - 404 not found"
    else
      puts " * Error communicating with Yahoo: #{ url } - #{ response.inspect }"
    end

  } # end on_complete callback

  hydra.queue(request)

} # end each row from database

# All desired connections are now queued, so execute them.
puts "Starting hydra..."
hydra.run
puts "Hydra done."

YahooFinance-Typhoeus gem demo

With the gem, the demo is dead simple. First, gem install yahoofinance-typhoeus. Then just:

 require 'rubygems'
 require 'yahoofinance-typhoeus'

 yf = YahooFinance.new # or YahooFinance.new(max_number_of_concurrent_connections)

 yf.add_query("AAPL", "2008-12-01", "2008-12-15") {|response| puts response}
 yf.add_query("IBM", "2008-10-15", "2008-10-30") {|response| puts response}

 yf.run

For a quick, one-off query, it’s even simpler (although the quick_query method does not make use of Typhoeus’s parallelism):

require 'rubygems'
require 'yahoofinance-typhoeus'

 YahooFinance.quick_query("AAPL", "2009-01-01", "2009-02-01")

You can even do it from the command line and then pipe the output to awk or less or whatever. (Note that you can’t use -r to load rubygems due to a bug/feature in the Ruby interpreter).

ruby -rubygems -e "require 'yahoofinance-typhoeus' ; puts YahooFinance.quick_query('AAPL', '2009-01-01', '2009-02-01')" |less

Popularity: 44% [?]

11 Responses to Yahoo Finance in Ruby with Typhoeus

  1. Hi, I ran across your ruby IB TWS interface today. Is this still an active project?

    Getting data through IB may solve your issues in that IB offers a much more efficient API (in theory). In addition to trading through IB, am thinking to use as a data source in preference to yahoo.

    I may be able to contribute some time around testing it if it is still an ongoing project …

    • Hi Jonathan,

      You’re right, the IB API would probably be a lot more efficient insofar as it keeps just one socket open for everything. The problem I ran into there is TWS: you have to have the TWS app open and logged into IB to be able to use the API and get historic data. TWS is designed (for some reason that is beyond me – maybe to fix reference leaks?) to shut itself down automatically once every day. As far as I know there is no way to get it to automatically log itself back in, so that means the user must manually re-authenticate at least once every 24 hours.

      I have my analysis system running on a server, and I want it to just do its thing without any manual intervention required, so I’m using authenticationless Yahoo EOD for it right now. If and when I move to EOD data, I’ll probably revisit the setup and move back to IB.

      I’ve shifted the focus of my coding more to Clojure lately, so I’m not doing too much with IB-Ruby these days. Ruby is a lot easier to get prototypes up and running than Clojure, so I do still use it sometimes. All the code is on Github and I’ll gladly merge any patches back in. I haven’t tested anything other than quotes and historic data downloading extensively, so there’s a lot that could be done as far as an automated test suite and example code. Let’s talk on IM sometime and see what we can come up with.

      Best,
      Paul

  2. Pingback: Yahoo Finance in Clojure | Paul Legato

  3. Hi Paul,

    This gem sounds awesome! I really want to use it, but I’m coding in windows and I’m going through hell trying to install libcurl… Any clue on how to do that?

    Thanks a lot,

    Tim

    • Hi Tim,

      Cygwin generally works well for using Unix stuff on Windows. I don’t have a Windows machine, so I’m not sure what exactly you have to do. It looks like there’s a Windows installer at http://curl.haxx.se/download.html , too. Let us know how it goes.

      Cheers,
      Paul

  4. I’m getting a SymbolNotFoundException for some of the date ranges I’m using.

    How can I trap the exception to allow the other queries to continue. Some stocks don’t seem to have data for certain date ranges.

    Thanks

    • Ty,

      In general, wrap the offending query in a begin...rescue...end block. You probably want to print or log some kind of error message for inspection.

      Can you post some more information, such as the symbol and date range that cause the error, and the stack trace?

      Paul

  5. YahooFinance.quick_query(“FCPT.L”, “2010-02-26″, “2010-03-26″)
    SymbolNotFoundException: FCPT.L not found at Yahoo
    from /Library/Ruby/Gems/1.8/gems/yahoofinance-typhoeus-1.0.0/lib/yahoofinance-typhoeus.rb:108:in `make_request’
    from /Library/Ruby/Gems/1.8/gems/typhoeus-0.2.4/lib/typhoeus/request.rb:135:in `call’
    from /Library/Ruby/Gems/1.8/gems/typhoeus-0.2.4/lib/typhoeus/request.rb:135:in `call_handlers’
    from /Library/Ruby/Gems/1.8/gems/typhoeus-0.2.4/lib/typhoeus/hydra.rb:218:in `handle_request’
    from /Library/Ruby/Gems/1.8/gems/typhoeus-0.2.4/lib/typhoeus/hydra.rb:187:in `get_easy_object’
    from /Library/Ruby/Gems/1.8/gems/typhoeus-0.2.4/lib/typhoeus/easy.rb:332:in `call’
    from /Library/Ruby/Gems/1.8/gems/typhoeus-0.2.4/lib/typhoeus/easy.rb:332:in `failure’
    from /Library/Ruby/Gems/1.8/gems/typhoeus-0.2.4/lib/typhoeus/multi.rb:21:in `multi_perform’
    from /Library/Ruby/Gems/1.8/gems/typhoeus-0.2.4/lib/typhoeus/multi.rb:21:in `perform’
    from /Library/Ruby/Gems/1.8/gems/typhoeus-0.2.4/lib/typhoeus/hydra.rb:95:in `run’
    from /Library/Ruby/Gems/1.8/gems/yahoofinance-typhoeus-1.0.0/lib/yahoofinance-typhoeus.rb:40:in `run’
    from /Library/Ruby/Gems/1.8/gems/yahoofinance-typhoeus-1.0.0/lib/yahoofinance-typhoeus.rb:82:in `quick_query’
    from (irb):3

  6. Sorry, forgot to add within my code i’m actually using add_query. Trapping at yp.run gives me the same information but it still terminates all the other add_query’s from executing.

    A retry in the rescue block didn’t help.

    Note: not a big issue, figured if there was a quick and simple solution I’ll do that instead of modifying my input file.

    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>