Archive for the 'Concurrency' Category

Concurrency: You’ll have to understand it one day soon.

It is pretty clear that scale is an important topic these days. Scaling cheaply and reliably in a seamless fashion to boot. I noticed recently several talks/tutorials on scale and concurrency at OSCON 2007 and that Web2.0 Expo and ETech07 had their share too. There have also been numerous mentions of Live Journal’s and Myspace’s approach to scaling out recirculating recently. These approaches are considered the current blue-print for handling scale of web applications based on LAMP. I suspect mostly because it suggests you should forget about scaling and just let your architecture evolve along the lines prescribed i.e. Scale is a nice problem to have. And in a pragmatic-way I totally agree with that.

Returning forager The general approach at the application layer can be summed up as: Partition your database around some key object in the system, typically a user. Replicate these partitions around your various database servers. Deploy your front-end on a web-farm and potentially factor out some application functionality to dedicated servers. Manage sessions separately either by creating a discrete service within the architecture or by resolving sessions to a particular database partition – basically push state down to a data-store where the application coordination happens. Then cache whatever you can. That’s it – well the devil is in the details.

Don MacAskill’s presentation at ETech07 details SmugMug’s architecture using Amazon’s S3 and experiments with EC2 and SQS. It contains a great lot of information about using Amazon’s Web Services. The utility model – or compute power on demand – of deploying services will become very prevalent over the next few years. While Amazon are the first movers, others will follow with potentially more powerful computing models.

All good. Basically they are all large grained approaches to leveraging concurrent execution using lots of boxes. Application design is just chopped up into a few tiers. It is easy to understand and at this point we can happily forget about concurrency because each component in the system is a big black box with synchronous calls between them.

Several things are happening that suggest our sequential synchronous existence may be over. Power consumption, Multi-core cpu’s, on-demand scaling as exemplified by Amazon’s EC2 service, transaction reliability and the rise of data mining; All make confronting scaling issues early on important for your application architecture.

The situation for CPU design and the implications for software design was clearly discussed, back in 2005, in a Dr Dobbs article by Herb Sutter. The Free Lunch is Over: A Fundamental Turn Towards Concurrency in Software. In the article Herb discusses the limitations of chip design and why this will lead to a focus on multi core design rather than increasing clock cycles on a single core. A key barrier to increasing the clock speed is power consumption. Heat is a by product of power consumption so power consumption isn’t related just to the operation of a machine but the cooling required to keep it operating. Power consumption is a major issue for Google and any company that runs data centers. Even my little server room at home should have air conditioning during the summer months! Server power consumption is one of the motivations given as justification for O’Reilly Media’s conference on energy innovation later this year.

While two or four core cpu architectures are now mainstream we are seeing an increase in the number of cores. An example demostrated recently is Intel’s 80-core research chip crunches 1 Tflop at 3.2 GHz

“Intel’s 80-core chip is basically a mainframe-on-a-chip-literally,” said McGregor. “It’s the equivalent of 80 blade processors plugged into a high-speed backplane, or 80 separate computers using a high-speed hardware interconnect.”

I suspect that given Moore’s Law we can expect some associated Law regarding the number of cores on a square inch wafer increasing exponentially over the coming years. No doubt there will be some correlation to Peak Oil issues and increasing awareness of Global Warming as the market impetus for more cores. If approaches to concurrency in software are any indication, I can easily imagine clock speeds decreasing. Chip designers would focus on throughput rather than outright performance. This would suggest the rise of commercial asynchronous chip designs. [Maybe they are already - I know nothing about current chip designs beyond their multi-coreness.] It would also fit with Clayton Christensen’s “law of conservation of modularity“.

Current software development techniques can only hide so much. Eventually we will need to utilise the multiple cores to maximise the benefit they provide. Multi-threading is notoriously difficult to do correctly so I don’t see that as a solution. Asynchronous event or message passing approaches provide a better way. Basically they model how hardware works and micro-kernel operating systems have been modelled in a similar fashion.

We have being doing this application composition stuff with messages for a while now (albeit synchronous in nature) with Unix pipes and the Web but these architectures are only just beginning to be appreciated beyond administration script-based coordination Unix pipes have been traditionally used for.

Google have raised awareness and appreciation with publication on internal toolkits like MapReduce, Chubby and BigTable.

One thing that surprises me about various presentations around scaling web applications is the lack of mention about message passing, specifically message queues or event buses. These are the staple tools when you want to scale and ensure data doesn’t get lost. Then again many of the consumer oriented services don’t need such reliability. The cost of losing a customers IM message, or comment post, or uploading of an image is minimal. An annoyance at worst. As web applications evolve into business applications losing data, due to scale demands, has real costs.

No comments

Is Erlang the language of choice

Commenting on the recent (early) announcement of the new Pragmatics book on Programming Erlang, Tim asks if Erlang will be “language of choice for concurrent programming.” I’ve had this discussion with people over the few years that I have been dabbling in Erlang.

Firstly there is a question of Erlang adoption, regardless of Erlang’s suitability for concurrent programming. Two barriers stand in the way of Erlang adoption for many people, Erlang two strengths:concurrency and functional language.

Functional Programming is not culturally accept or understood very widely. Still considered as more academic. Real effort is required to learn a new programming paradigm so you do it out of interest. It is great that languages like Ruby, Python, XSLT and a rise in the love for Javascript (Javascript is a Functional Programming Language at it’s core), all help raise the level of understanding for expressive power in languages. These languages all make functional primitives available which is good. However unless you have read Abelson or similar, done a course in functional languages the concepts, when directly confronted with them, are just to foreign to get to grips with. We don’t think functionally even though it is not that different from an Object Oriented perspective. Maybe that illustrates my point; most OO languages are used in a object based fashion. It is just so hard to shake our procedural world view.

I’m sure there is a relationship between this and the markets acceptance of Lotus Improv (now Quantrix – think OLAP spreadsheet and is really good for financial modelling); in Excel you could be ad-hoc and evolve what you did, Improv forced you to think first. Which reminds me that millions of people program in a functional manner, they just don’t know it. Anyway it doesn’t help that a lot of writing about functional programming concepts are just explained badly.

The second difficulty with Erlang is concurrency. Again the concepts behind concurrent asynchronous message passing software development are just unnatural. We solve and think of problems synchronously. Weird since we live and operate in an asynchronous world. The odd thing is that in Object Oriented systems “messages” are passed between objects. Component development is not that different either. In Erlang Processes are objects and messages are passed between them.

Aside from the above I can’t really comment on Erlang being the language of choice for concurrent systems. For me it is, therefore I’ll just spread the word. Take a look. A clean simple language with a supporting framework for building robust systems. Joe’s latest book, that I purchased immediately, will be a great contribution to the language and no doubt will introduce many more people to the language too.

Update: Seems Ralph Johnson thinks Erlang is Object Oriented too. In a recent post “Erlang, the next Java” he expresses a similar view: that Erlang is an Object Oriented Language…Therefore the Erlang Community (Joe Armstrong in particular) needs to down play the importance of the Functional Language aspects and instead focus on the Message passing between processes (i.e. Processes are just objects that receive and send messages).

1 comment

Scaling Yahoo! Pipes

Yahoo! Pipes are CloggedIf you’ve been trying to access Yahoo! Pipes and come up with a request for Mario then have some sympathy for Yahoo! as Dare Obasanjo comments.

As someone who works on large scale online services for a living, Yahoo! Pipes seems like a scary proposition.

Basically the service is likely to have hit the twin sisters of scaling large systems, CPU and I/O bounds.The nature and flexibility that Yahoo! Pipes provides: User defined queries over changing data streams that are activated every time a pipe is “pulled” will create a heavy load. Dare continues.

It combines providing a service that is known for causing scale issues due to heavy I/O requirements (i.e. serving RSS feeds) with one that is known for scaling issues due to heavy CPU and I/O requirements (i.e. user-defined queries over rapidly changing data). I suspect that this combination of features makes Yahoo! Pipes resistant to popular caching techniques especially if the screenshot below is any indication of the amount of flexibility [and thus processing power required] that is given to users in creating queries.

This has always been an issue if you consider a centralised event routing infrastructure. I suspect it is made worse by the “pull” nature of feeds. Even if if-modified headers or etags are used in the HTTP request. To determine if a user-defined feed has changed it would require processing the whole chain effectively adding a whole lot of processing over head. This is however a naive way of approaching the problem. An internal event architecture and caching are applicable. The caching is just different to the way content caching currently works. However the answer, I suspect, is not to centralise. The general infrastructure represented by Yahoo! Pipes should be deployed by millions. I have my own router for IP traffic so why not have my own router for application data traffic?

No comments