Re: ARTICLE: Memory bandwidth

From: James Rogers (jamesr@best.com)
Date: Fri Apr 13 2001 - 22:06:12 MDT


On 4/13/01 6:15 PM, "Brian Atkins" <brian@posthuman.com> wrote:

> I didn't even read the article yet, but I have read about the P4 quite a
> bit. I don't think it is really appropriate to draw conclusions based on
> comparing it to P3 since they are radically different, and IMO the P4 is
> a botched design. They were trying for something really amazing with it,
> but a lot got gutted out of the design resulting in a very unbalanced
> CPU. In many benchmarks for instance a slower clocked P3 would beat a P4.

You are somewhat missing the point. The comparisons are in fact typically
between a 800-MHz P3 and 1.5-GHz P4, but the difference in clock speed is
immaterial if you actually look at the benchmarks (mostly because the
benchmarks show *why* clock speed is immaterial). In fact, the article
squares perfectly with what I've been hearing on the hardware lists where
people have been running their own benchmarks.

In short:

For fp and vector codes *where the data fits in the cache*, you generally
get much better performance with the P4 than the P3 i.e. Better than
suggested by the clock speed difference. Since it fits in the cache, memory
bandwidth and latency is mostly irrelevant.

For fp and vector codes that have data sets substantially larger than the
cache, the speed of the processor is irrelevant: a P3 will be memory limited
and the P4 will be memory limited. The P4 has a poor memory architecture
and benchmarks as a dog compared to similarly clocked Athlon chips, due
solely to differences in memory architecture; the Athlon has more bandwidth,
so for memory limited problems it is faster, even at slower clock speeds.
Note that the P4 has 15-20% *worse* memory latency than the P3, so cache
efficiency is even more important on the P4 than on the P3. What the P4
represents is one of the first times where the inter-generational
differences in processor performance are based almost solely on the memory
bus chipsets, not on the processor clock.

As for having a botched design, you are mostly correct -- the memory
architecture is much worse than it has to be (addressed in the article).
The problem is that Intel doesn't have a huge amount of room to improve it.
For those classes of problem where everything (or most everything) fits in
the high-speed cache, it screams. Unfortunately, we aren't talking about
those types of problems.

The point to all this being that if you are running an AI, which presumably
will be churning on vast quantities of data, the clock speed doesn't matter.
What the benchmarks between the P3 and P4 show is that memory bandwidth is
already a serious crisis for data hungry apps. The situation on the P4 is
so bad that they expect faster versions of the P4 to show no performance
improvement for cache killers, whereas it used to be that you got *some*
minor improvement with increased clock speeds even for these apps.

-James Rogers
 jamesr@best.com



This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:36 MDT