Scaling Yahoo! Pipes

Yahoo! Pipes are CloggedIf you’ve been trying to access Yahoo! Pipes and come up with a request for Mario then have some sympathy for Yahoo! as Dare Obasanjo comments.

As someone who works on large scale online services for a living, Yahoo! Pipes seems like a scary proposition.

Basically the service is likely to have hit the twin sisters of scaling large systems, CPU and I/O bounds.The nature and flexibility that Yahoo! Pipes provides: User defined queries over changing data streams that are activated every time a pipe is “pulled” will create a heavy load. Dare continues.

It combines providing a service that is known for causing scale issues due to heavy I/O requirements (i.e. serving RSS feeds) with one that is known for scaling issues due to heavy CPU and I/O requirements (i.e. user-defined queries over rapidly changing data). I suspect that this combination of features makes Yahoo! Pipes resistant to popular caching techniques especially if the screenshot below is any indication of the amount of flexibility [and thus processing power required] that is given to users in creating queries.

This has always been an issue if you consider a centralised event routing infrastructure. I suspect it is made worse by the “pull” nature of feeds. Even if if-modified headers or etags are used in the HTTP request. To determine if a user-defined feed has changed it would require processing the whole chain effectively adding a whole lot of processing over head. This is however a naive way of approaching the problem. An internal event architecture and caching are applicable. The caching is just different to the way content caching currently works. However the answer, I suspect, is not to centralise. The general infrastructure represented by Yahoo! Pipes should be deployed by millions. I have my own router for IP traffic so why not have my own router for application data traffic?

No comments Digg this

No comments yet. Be the first.

Leave a reply