Abstract:
The full paper is here. The Pop sci version is here. The Press release version is here.
That's Mike whom I have collaborated with for the SC05 Tri Challenge where we ran a copy of CAM on bassi, reduced the data on PDSF, and visualized it on a system that was based in Seattle that is very similar to davinci while using a wide area instance of GPFS as the only method of communication. It worked okay, but that was because we hit a bug. Alas. That same bug would come out and nail us a few weeks later at work. We did win a prize and some very nice recognition for the effort though.
That said, because of my mischief I have been attending conferences for embedded and real-time coding that have de nada to do with work. Or so it was planned. After all, I work in HPC with an emphasis on file systems. My research interests there are wide area file systems, on demand system aggregation, and the odd bit of paleoclimate work (emphasis Permian). The Mischief has de nada to do with that, as I said. Well, it turns out that NERSC & LBNL in general is making a push to make a raid on the next set of commodity computing parts and I had walked into the middle of it all. Yep, you guessed it, the embedded chip guys: those chips that run in your cell phones or printer or even in your car.
You see, what's killing PCs and everything higher is the economics of it all: single CPUs are pretty flat for processor speed. They're not getting faster. There are a few single core CPUs left that are making any progress doing faster computation: the power and heat requirements are killing them. Ever notice that the CPU of the 386 didn't require that monster fan you have on your current x86? Ever think about that power supply? Now we have hit the point where power consumption (and the heat dissipation associated with it) are now growing faster than the computational speed. The answer was to start adding cores onto the chip.
Starting a few years back, the chip companies started putting more and more cores on a single "CPU." Really these were more like multiple CPUs on the same silicon real estate than anything else. However, they were tied together more closely than the older SMPs. However, the cores tend to be 'simpler' than the one CPUs before. One reason being to save power. In some ways this strikes me as very amusing. It's like revenge of the RISC. Even so, the costs of developing those chips for the CPU monsters is still obscene.
Think rocket cost and then some. I can't share a lot of what I know, but I will just blurb that its shocking. I'll also say I have to tread carefully here due to NDAisms. (oy)
So why is the HPC world - or at least LBL - interested in the embedded products? One: cost. The cost of developing an embedded chip is cheaper. By a lot. Second is power consumption. Now, your desktop might consume a good 500 watts. In an oversimplification, Franklin has 9k sockets with 2 cores per socket. If you were to merely scale up the power requirements: you're talking 4.5 megawatts there for a 100 teraflop system. Not Good. Then for a petaflop system do at least a 10x. Now consider the next 'frontier:' the exaflop system. Yes, 1000x the speed of the systems not yet built (but coming in the next year or two). Yep. That power requirement kills the exaflop system based on current or extrapolated PC/server processor tech.
The embedded guys are very concerned if the processors start using more than single digit watts. Our guys were intrigued by that and started plying with the concept. I wasn't involved at all until quite recently. Even now I am only involved to a degree: I happen to have contacts that the Lab would like to use. They already have when we pulled an Intel embedded chip architect I connected to at a conference I was invited to by one of my mischief sponsors. What I have learned about their plans for the nextgen supercomputer is rather interesting.
John, Mike, and Lenny's design for the above is to develop a HPC platform that is cheaper, tailored to a specific application but able to run others, and fast. Really fast. The goal for the above was an exaflop according to the presentations I was at for this. The whole idea was to be able to simulate a 1 km grid for the CCSM climate simulations. It would allow for accurate cloud simulations which is something they can only approximate now. The power consumption and the cost of the machine would be far, far lower than now. The design they talked about was...interesting.
One of the key ideas though is the transition from single core to multicore and then the projected, in the center's opinion, to manycore. They are expecting around 1024 cores per chip, but much, much smaller and simpler cores. Consider that you put together 64 of those chips and you have a machine witha s many cores as the old CM2-64k that I used back in the early 90s. 256 of these chips and you would have more cores than in the Blue Gene at LLNL. (note: it wouldn't be as fast, but far cheaper and less power consumptive). The proposal though is that John, Mike, and Lenny have proposed that the machine they want would have 170,000 sockets and each with a chip not too dissimilar to what I outlined above living in that socket. o.O
That is not to say I don't have my worries. I am not working largely as the application guy. I do dabble in it. I do have interest there. Programming models that scale past 1k cores are not very common now. What would need 170k sockets? let alone the cores therein. In fact, there are a nontrivial number of 'supercomputer users' that really don't belong on our machines anymore. They just need a cluster of medium to small size in their department rather than the big iron we have. Now try to imagine those few that are actually able to use our monster here. Not very many. In this case, the machine was designed with the use of one code only and the others can run on it as they can.
However, I have to think about the times I spend on rotation too here. If we were to acquire one of those ubersocketed monsters, then there would be a constant rain of dead or dying processors even in the lightest weight of nodes. Even with a 99.999% reliability you are still going to have nearly two nodes down at a time. That's not so good to have to deal with. Now, about all that memory too...
There's much more to go on, but I don't have time at the moment. I have yet another dreaded meeting. bah. Just ponder what they are talking about when they talk supercomputers of the future and that infernal Singularity nonsense.
Towards Ultra-High Resolution Models of Climate and Weather
Michael Wehner
CRD/NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA 94720
Leonid Oliker
CRD/NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, loliker@lbl.gov
John Shalf
CRD/NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA 94720
We present a speculative extrapolation of the performance aspects of an atmospheric general circulation model to ultra-high resolution and describe alternative technological paths to realize integration of such a model in the relatively near future. Due to a superlinear scaling of the computational burden dictated by stability criterion, the solution of the equations of motion dominate the calculation at ultra-high resolutions. From this extrapolation, it is estimated that a credible kilometer scale atmospheric model would require at least a sustained ten petaflop computer to provide scientifically useful climate simulations. Our design study portends an alternate strategy for practical power-efficient implementations of petaflop scale systems. Embedded processor technology could be exploited to tailor a custom machine designed to ultra-high climate model specifications at relatively affordable cost and power considerations. The major conceptual changes required by a kilometer scale climate model are certain to be difficult to implement. Although the hardware, software, and algorithms are all equally critical in conducting ultra-high climate resolution studies, it is likely that the necessary petaflop computing technology will be available in advance of a credible kilometer scale climate model.
The full paper is here. The Pop sci version is here. The Press release version is here.
That's Mike whom I have collaborated with for the SC05 Tri Challenge where we ran a copy of CAM on bassi, reduced the data on PDSF, and visualized it on a system that was based in Seattle that is very similar to davinci while using a wide area instance of GPFS as the only method of communication. It worked okay, but that was because we hit a bug. Alas. That same bug would come out and nail us a few weeks later at work. We did win a prize and some very nice recognition for the effort though.
That said, because of my mischief I have been attending conferences for embedded and real-time coding that have de nada to do with work. Or so it was planned. After all, I work in HPC with an emphasis on file systems. My research interests there are wide area file systems, on demand system aggregation, and the odd bit of paleoclimate work (emphasis Permian). The Mischief has de nada to do with that, as I said. Well, it turns out that NERSC & LBNL in general is making a push to make a raid on the next set of commodity computing parts and I had walked into the middle of it all. Yep, you guessed it, the embedded chip guys: those chips that run in your cell phones or printer or even in your car.
You see, what's killing PCs and everything higher is the economics of it all: single CPUs are pretty flat for processor speed. They're not getting faster. There are a few single core CPUs left that are making any progress doing faster computation: the power and heat requirements are killing them. Ever notice that the CPU of the 386 didn't require that monster fan you have on your current x86? Ever think about that power supply? Now we have hit the point where power consumption (and the heat dissipation associated with it) are now growing faster than the computational speed. The answer was to start adding cores onto the chip.
Starting a few years back, the chip companies started putting more and more cores on a single "CPU." Really these were more like multiple CPUs on the same silicon real estate than anything else. However, they were tied together more closely than the older SMPs. However, the cores tend to be 'simpler' than the one CPUs before. One reason being to save power. In some ways this strikes me as very amusing. It's like revenge of the RISC. Even so, the costs of developing those chips for the CPU monsters is still obscene.
Think rocket cost and then some. I can't share a lot of what I know, but I will just blurb that its shocking. I'll also say I have to tread carefully here due to NDAisms. (oy)
So why is the HPC world - or at least LBL - interested in the embedded products? One: cost. The cost of developing an embedded chip is cheaper. By a lot. Second is power consumption. Now, your desktop might consume a good 500 watts. In an oversimplification, Franklin has 9k sockets with 2 cores per socket. If you were to merely scale up the power requirements: you're talking 4.5 megawatts there for a 100 teraflop system. Not Good. Then for a petaflop system do at least a 10x. Now consider the next 'frontier:' the exaflop system. Yes, 1000x the speed of the systems not yet built (but coming in the next year or two). Yep. That power requirement kills the exaflop system based on current or extrapolated PC/server processor tech.
The embedded guys are very concerned if the processors start using more than single digit watts. Our guys were intrigued by that and started plying with the concept. I wasn't involved at all until quite recently. Even now I am only involved to a degree: I happen to have contacts that the Lab would like to use. They already have when we pulled an Intel embedded chip architect I connected to at a conference I was invited to by one of my mischief sponsors. What I have learned about their plans for the nextgen supercomputer is rather interesting.
John, Mike, and Lenny's design for the above is to develop a HPC platform that is cheaper, tailored to a specific application but able to run others, and fast. Really fast. The goal for the above was an exaflop according to the presentations I was at for this. The whole idea was to be able to simulate a 1 km grid for the CCSM climate simulations. It would allow for accurate cloud simulations which is something they can only approximate now. The power consumption and the cost of the machine would be far, far lower than now. The design they talked about was...interesting.
One of the key ideas though is the transition from single core to multicore and then the projected, in the center's opinion, to manycore. They are expecting around 1024 cores per chip, but much, much smaller and simpler cores. Consider that you put together 64 of those chips and you have a machine witha s many cores as the old CM2-64k that I used back in the early 90s. 256 of these chips and you would have more cores than in the Blue Gene at LLNL. (note: it wouldn't be as fast, but far cheaper and less power consumptive). The proposal though is that John, Mike, and Lenny have proposed that the machine they want would have 170,000 sockets and each with a chip not too dissimilar to what I outlined above living in that socket. o.O
That is not to say I don't have my worries. I am not working largely as the application guy. I do dabble in it. I do have interest there. Programming models that scale past 1k cores are not very common now. What would need 170k sockets? let alone the cores therein. In fact, there are a nontrivial number of 'supercomputer users' that really don't belong on our machines anymore. They just need a cluster of medium to small size in their department rather than the big iron we have. Now try to imagine those few that are actually able to use our monster here. Not very many. In this case, the machine was designed with the use of one code only and the others can run on it as they can.
However, I have to think about the times I spend on rotation too here. If we were to acquire one of those ubersocketed monsters, then there would be a constant rain of dead or dying processors even in the lightest weight of nodes. Even with a 99.999% reliability you are still going to have nearly two nodes down at a time. That's not so good to have to deal with. Now, about all that memory too...
There's much more to go on, but I don't have time at the moment. I have yet another dreaded meeting. bah. Just ponder what they are talking about when they talk supercomputers of the future and that infernal Singularity nonsense.
Yes. I understood very little of that post, being as skilled with computer jargon as an orang with a spear (hey, wait a second...), but I understood the basic premise.
ReplyDeleteNow, I read awhile back a Scientific American article detailing how scientists are trying to use PHOTONS to carry information. Is this all just happy happy joy joy talk or is something like that actually possible?
uh Zach.
ReplyDeleteHow did you think your message got to me?