<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Paul Legato &#187; ruby</title>
	<atom:link href="http://www.paullegato.com/blog/tag/ruby/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.paullegato.com</link>
	<description></description>
	<lastBuildDate>Tue, 06 Dec 2011 00:52:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Yahoo Finance in Ruby with Typhoeus</title>
		<link>http://www.paullegato.com/blog/yahoo-finance-ruby-typhoeus/</link>
		<comments>http://www.paullegato.com/blog/yahoo-finance-ruby-typhoeus/#comments</comments>
		<pubDate>Fri, 19 Feb 2010 01:25:54 +0000</pubDate>
		<dc:creator>Paul Legato</dc:creator>
				<category><![CDATA[Tech]]></category>
		<category><![CDATA[automated trading]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://www.paullegato.com/?p=83</guid>
		<description><![CDATA[A fast and efficient way to download historic stock data from Yahoo! Finance with Ruby.]]></description>
			<content:encoded><![CDATA[<p>Yahoo! Finance allows clients to download free historic stock data.  Transparentech&#8217;s <a target="_blank" href="http://www.transparentech.com/opensource/yahoofinance" >YahooFinance Ruby gem</a> works well, but getting historic data for a large number of symbols is painfully slow. YahooFinance (the gem)&#8217;s download of the end-of-day historic quotes for about 10,000 symbols was taking 2 hours or more. With Typhoeus and its parallel connections, I can run the same download in under 7 minutes.</p>
<p><a target="_blank" href="http://github.com/pauldix/typhoeus" >Paul Dix&#8217;s Typhoeus</a> appeared while Googling today. Not only is it an awesomely fast wrapper for <a target="_blank" href="http://curl.haxx.se/" >libcurl</a>, it allows you to queue requests and then execute them concurrently.</p>
<p><strong>Update:</strong> I&#8217;ve just released a Ruby gem based on this code. It&#8217;s on <a target="_blank" href="http://gemcutter.org/gems/yahoofinance-typhoeus" >Gemcutter</a>, and <a target="_blank" href="http://github.com/pjlegato/yahoofinance-typhoeus" >the source is on GitHub</a>. You can install it by just doing a &#8220;<code>sudo gem install yahoofinance-typhoeus</code>&#8220;.<br />
<span id="more-83"></span></p>
<h2>Why the YahooFinance gem is slow: Net::HTTP, serial connections</h2>
<p>YahooFinance uses the Net::HTTP classes internally, which are <a target="_blank" href="http://apocryph.org/2008/10/04/analysis_ruby_18x_http_client_performance/" >extremely</a> <a target="_blank" href="http://www.pauldix.net/2009/01/ruby-http-client-library-performance.html" >slow</a> (the former helpfully summarizing his inspection of the code as &#8220;Ruby’s Net::HTTP implementation blows&#8221;.) There are several alternatives mentioned in those tests. I&#8217;ve settled on Typhoeus for now.</p>
<p>Typhoeus uses libcurl to do the heavy lifting. Libcurl is robust and well-tuned after years of real-world use. While I don&#8217;t know if it&#8217;s the fastest Ruby HTTP library available, it&#8217;s orders of magnitude better than Net::HTTP, and it&#8217;s Good Enough&trade; for now.</p>
<p>Net::HTTP is undoubtely far and away the largest single performance bottleneck in YahooFinance. There are other potential issues, but they are comparatively minor and could all easily be avoided during the conversion to Typhoeus anyway. For example&#8230;</p>
<p>Yahoo Finance (the actual web service, not the gem) apparently won&#8217;t return more than 200 days of historic data for some foreign stocks (according to the comments in the gem), so YahooFinance the gem includes chunking logic to retry and aggregate requests for more than that amount of data. This minorly impacts performance. I won&#8217;t ever be requesting more than 200 days in the production system (which downloads new data nightly) anyway, and I have very few foreign stocks. I suppose this will require revisiting and implementing chunking logic when I bootstrap a new server, but for this phase of testing, this is not an issue.</p>
<ol>
<li>An entirely new TCP socket to Yahoo was opened and then torn down for each and every symbol, incurring that overhead every time. Libcurl uses persistent connections by default. (I&#8217;m not sure whether Typhoeus takes advantage of this functionality &#8211; it&#8217;s Fast Enough that I haven&#8217;t looked into it.) I imagine this is the major issue, since the amount of data transferred per request is not large enough to exceed even the 1k buffer in Ruby 1.8.6, much less the 16k one in 1.8.7.</li>
<li>The chunking logic discussed above.</li>
<li>YahooFinance (the gem) was then creating an entire new HistoricalQuote object for every symbol to pass back to my code. It apparently has an array-based option for better performance, but it was only slightly more trouble to refactor to Typhoeus than to that, so I took the opportunity to ditch Net::HTTP.</li>
<li>Not only does it use Net::HTTP, it uses Net::HTTP::Proxy. This is probably only a marginal slowdown, but on the other hand it sure wasn&#8217;t making it any faster.</li>
<li>YahooFinance.gem uses <code>CSV.parse</code> to parse the raw historic data sent back by Yahoo. Probably not all that big of a deal, but again, it was certainly not helping either. Why bother? I just split on \n, loop on that, and split on comma instead. I&#8217;m going to loop on it anyway to insert it into the database. Why loop twice?</li>
</ol>
<h2>Typhoeus code conversion</h2>
<p>Typhoeus&#8217; callback model required some minor modifications to my data loader design. My previous system was a rapid prototype designed for quick coding rather than performance (premature optimization being the root of all evil, as the sage said.) In short, I grabbed the list of stock symbols from the database and looped on them, doing a YahooFinance historic data request for each and stuffing the result back into the database. This worked fine for 20 or 30 or 100 symbols, but not so well with 10,000 symbols.</p>
<p>The overall logic with Typhoeus is similar: loop on the stock list, but instead of grabbing the data for each, queue a request with Typhoeus&#8217; <code>hydra</code> dispatcher. Once all the requests have been queued, a blocking call to <code>hydra.run</code> begins the concurrent HTTP downloads. </p>
<p>Typhoeus does not seem to have a non-blocking way to run the HTTP queue. A further optimization would fork off a new thread to do the blockng <code>run</code> call in the background while still processing the database stock list in the main thread, then join the Hydra thread after all requests for all symbols have been queued. However, Ruby uses green threads &#8211; might this be an issue for running the C-based libcurl that powers Typhoeus under the hood? In any case, the code is now fast enough that I don&#8217;t have to worry about this optimization right now.</p>
<h2>Yahoo Finance with Typhoeus Code</h2>
<p>Here is a demo of the Yahoo Finance with Typhoeus code. I haven&#8217;t included app-specific parts for getting the stock symbols from my database or storing data from Yahoo Finance into the database. It should be easily adaptable to other applications. I&#8217;m thinking of writing a little wrapper around this and publishing it as a Ruby Gem.</p>
<p><strong>Update:</strong> Gem now available at <a target="_blank" href="http://gemcutter.org/gems/yahoofinance-typhoeus" >Gemcutter</a>, and source <a target="_blank" href="http://github.com/pjlegato/yahoofinance-typhoeus" >on GitHub</a>. You can install it with &#8220;<code>gem install yahoofinance-typhoeus</code>&#8220;.</p>
<pre class="brush: ruby; title: ; notranslate">
#!/usr/bin/env ruby
#
# Load historic data for the given symbols for the last 14 calendar
# days. This will return varying amounts of data depending on how many
# trading days there were within the last 14 calendar days at the time
# you run this.
#
# Copyright (C) 2010 Paul Legato. License to use this code is granted
# under the terms of the GNU Public License version 2.
#

require 'rubygems'
require 'typhoeus'
require 'date'

Symbols = [ 'AAPL', 'GOOG', 'NonexistentSymbol', 'INTC', ]

Today = Date.today
StartDate = Today - 14

hydra = Typhoeus::Hydra.new(:max_concurrency =&gt; 20)
hydra.disable_memoization # save memory, since we won't ever use the memo

Symbols.each{|symbol|
  puts &quot;Querying for #{ symbol }...&quot;

  url = &quot;http://itable.finance.yahoo.com&quot; +
  &quot;/table.csv?s=#{ symbol }&amp;g=d&quot; +
  &quot;&amp;a=#{ StartDate.month - 1 }&amp;b=#{ StartDate.mday }&amp;c=#{ StartDate.year }&quot; +
  &quot;&amp;d=#{ Today.month - 1 }&amp;e=#{ Today.mday }&amp;f=#{ Today.year.to_s }&quot;

  request = Typhoeus::Request.new(url, :method =&gt; :get)

  request.on_complete {|response|
    if response.code == 200
      if response.body[0..40] != &quot;Date,Open,High,Low,Close,Volume,Adj Close&quot;
        puts &quot; * Error: Unknown response body from Yahoo - #{ response.body[0..40] } ...&quot;
      else
        # good response. go.
        count = 0

        response.body.split(&quot;\n&quot;).each{|line|
          next if line[0..40] == &quot;Date,Open,High,Low,Close,Volume,Adj Close&quot; # skip header line

          # There's no point splitting it if we're just going to print it
          # back out as CSV, but this is a demo. You would typically be
          # stuffing this into a database here.
          data = line.split(&quot;,&quot;)
          puts(&quot;#{ symbol },&quot; +
               &quot;#{ data[0] },&quot; + # date
               &quot;#{ data[1] },&quot; + # open
               &quot;#{ data[2] },&quot; + # high
               &quot;#{ data[3] },&quot; + # low
               &quot;#{ data[4] },&quot; + # close
               &quot;#{ data[5] },&quot; + # volume
               &quot;#{ data[6] }&quot;) # adjusted close

          count += 1
        } # end each line of response

        puts &quot;* #{ symbol } - loaded #{ count } days.&quot; if count &gt; 0
      end
    elsif response.code == 404
      puts &quot;#{ symbol } - 404 not found&quot;
    else
      puts &quot; * Error communicating with Yahoo: #{ url } - #{ response.inspect }&quot;
    end

  } # end on_complete callback

  hydra.queue(request)

} # end each row from database

# All desired connections are now queued, so execute them.
puts &quot;Starting hydra...&quot;
hydra.run
puts &quot;Hydra done.&quot;
</pre>
<h2>YahooFinance-Typhoeus gem demo</h2>
<p>With the gem, the demo is dead simple. First, <code>gem install yahoofinance-typhoeus</code>. Then just:</p>
<pre class="brush: ruby; title: ; notranslate">
 require 'rubygems'
 require 'yahoofinance-typhoeus'

 yf = YahooFinance.new # or YahooFinance.new(max_number_of_concurrent_connections)

 yf.add_query(&quot;AAPL&quot;, &quot;2008-12-01&quot;, &quot;2008-12-15&quot;) {|response| puts response}
 yf.add_query(&quot;IBM&quot;, &quot;2008-10-15&quot;, &quot;2008-10-30&quot;) {|response| puts response}

 yf.run
</pre>
<p>For a quick, one-off query, it&#8217;s even simpler (although the <code>quick_query</code> method does not make use of Typhoeus&#8217;s parallelism):</p>
<pre class="brush: ruby; title: ; notranslate">
require 'rubygems'
require 'yahoofinance-typhoeus'

 YahooFinance.quick_query(&quot;AAPL&quot;, &quot;2009-01-01&quot;, &quot;2009-02-01&quot;)
</pre>
<p>You can even do it from the command line and then pipe the output to awk or less or whatever. (Note that you can&#8217;t use -r to load rubygems due to a bug/feature in the Ruby interpreter).</p>
<pre class="brush: bash; title: ; notranslate">
ruby -rubygems -e &quot;require 'yahoofinance-typhoeus' ; puts YahooFinance.quick_query('AAPL', '2009-01-01', '2009-02-01')&quot; |less
</pre>
<img src="http://www.paullegato.com/?ak_action=api_record_view&id=83&type=feed" alt="" />]]></content:encoded>
			<wfw:commentRss>http://www.paullegato.com/blog/yahoo-finance-ruby-typhoeus/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
	</channel>
</rss>

