From: J. Andrew Rogers (email@example.com)
Date: Thu Jul 08 2004 - 13:49:11 MDT
> * code for today's/next year's commodity iron, but think
> about portability to massively parallel async molecular
> electronics systems with mole amounts of switches.
Good advice, but not always so easy to do in practice.
Maximum scalability on current hardware may require making
design choices that are not terribly compatible with a
design choice you would make in ten years. There are many
codes that I would want to do with fine-grain message
passing in the future that I wouldn't want to do that way
today because they won't scale well.
> * your code should scale O(0) whether for 10^0 or 10^9
> nodes (yes, it can be done with the right topology and
> communication pattern) -- if it doesn't you're doing
> someting really wrong. Screw Amdahl, no sequential
> sections allowed.
Not bloody likely. Even with fine-grained fully async MP
architectures and the code to go with it, you'll still have
to sync up regionally if you want to get a meaningful answer
out of it. Even the human brain exhibits no ability to
partition its resources beyond a coarse level.
> * avoid writing to disk frequently, try to do without
> swap. There's a special case if you have a large library
> of patterns, and can stream sequentially, or tolerate 10
> ms fetches sometimes.
For that matter, avoid reading from disk frequently. Disk
is just a grossly inefficient form of RAM, and if normal
core latency is a bottleneck (and it is for many codes),
having core with 10ms latency is almost useless.
This also means that any kind of RAM connected to switched
fiber within a 100km radius is vastly better than local
disk. If there was an easy way to mount the RAM of an
arbitrary number of network connected boxes as local 64-bit
swap space, that would be interesting. (It sounds like
something that should be easy to do on Linux or similar, but
it isn't in practice.)
Useful even for codes that don't need a ton of CPU.
> * use arrays of small data types, avoid absolute pointers
> (use relative addressing as offsets to current position)
If you don't make a habit of using the system to allocate
gobs of little objects, a good compiler will do the rest.
The structure of the arrays matters too. How many cache
lines that will get blown out can vary quite a bit depending
on how you lay out those data types. Two structure arrays
of identical size and type but with different internal
layout can cause integer factor differences in performance.
> * only access memory in stream mode, try to limit
> unpredictable accesses to 1st/2nd level cache (here it
> actually pays to use assembly for prefetch, otherwise
> don't bother with assembly but maybe for inner loop when
> you're done. This will cut your harware budget by some
> 300%, ditto operation costs.
The biggest performance improvements I've gotten were from
memory access optimization i.e. maximizing the number of
cache hits at all levels. You can do this in C and
largely eliminate any nominal benefit one might see from
In fact, I've discovered that big ugly "slow" codes that
have been optimized for extreme cache efficiency (which
almost invariably is what makes them "big and ugly") will
often run integer factors faster than the tight and elegant
"ideal" implementations. Most of the pipelines in a modern
CPU sit idle anyway; making them even more idle by
optimizing for instruction efficiency often won't net much.
Optimizing for cache efficiency gives all those pipelines
something to do, instruction optimized or not.
> * don't use floats, stick to integers
> * if you use integers, try using short integers (4-8 bit
> is great)
Also good advice. Integers are silicon friendly, and using
small integers will help sneak a lot of things inside a
cache line. As for byte-slicing, I would only do that where
there is an obvious benefit e.g. for cache line boundaries.
The extra instructions are basically free, but if it isn't
needed then keep things simple. A wash in many cases.
As for floats, how much dynamic range do you really need
> * stick to integers and codes you can crunch in parallel
> with MMX/SSE* type of SWAR SIMD
You only have so much freedom in algorithm selection. Some
codes won't be SIMD friendly in any significant sense.
> * make sure your inner loop would translate into logic
> some 10^3 gates simple for each individual small integer
With a fully optimized compile, my core code can currently
fit entirely inside the L1 instruction cache of most modern
CPUs. Which is close enough to FPGA for me. Hard to say if
that will last though. Probably won't.
> * put the complexity into data pattern, not code pattern.
> Code is not that important, state is.
Or to rephrase, there is no difference between state and
code. Making the code as tight as possible will keep the
other cache free for more important things.
> * if you think you can't do it in C, or don't need to,
> you're doing it wrong
Most likely, though I know there will be howls of protest.
The only reason this isn't true is if you can afford to blow
an order of magnitude of both speed and memory consumption
for a friendlier language. Personally, I find any
differences in performance and efficiency that can be
described in terms of "orders of magnitude" to be
qualitative and therefore important.
And no, JIT compiled Java etc does not even come close for
types of codes Eugen is talking about. Languages like Java
have pathological cache behaviors that cannot be optimized
away. This is precisely the space where C becomes glorious
in its capabilities compared to most other languages.
> * if can't describe the algorithm and the data flow on a
> large piece of paper, you're doing it wrong
j. andrew rogers
This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:48 MDT