**From:** James Rogers (*jamesr@best.com*)

**Date:** Tue Sep 16 2003 - 02:54:05 MDT

**Next message:**Christian Szegedy: "Re: Pattern representation and Solomonoff"**Previous message:**Gordon Worley: "Re: [Fwd: more networking applications: tribe.net]"**Next in thread:**Christian Szegedy: "Re: Pattern representation and Solomonoff"**Reply:**Christian Szegedy: "Re: Pattern representation and Solomonoff"**Reply:**James Rogers: "RE: Pattern representation and Solomonoff"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]

There are so many different concepts interwoven into a single elegant model

that I get easily distracted when explaining certain aspects of my work.

My use of the word "convergent" in a previous email was perhaps unfortunate;

the importance of it was implied by a layer or two of indirection for

someone with the appropriate context.

I'm going to start exploring some of the basic interesting points mentioned

in the title here, but I'm going to work from a more basic data-oriented

perspective rather than from the assumption of a particular knowledge

representation framework which has almost no context for most of the people

reading this.

Let us assume that we have a trivial function that generates the following

infinite stream of characters:

F = "1947194719471947..."

Writing a program that can express such a function is a minor feat, one line

in Python and many other languages. By definition, this function has a low

Kolmogorov complexity. More importantly, the entropy of the function has a

very specific profile, being locally high and globally low. Locally, small

fragments have high entropy taken by themselves (e.g. "4719"), and are not

meaningfully regular. Taken over long fragments, higher order regularities

are apparent even upon casual observation (e.g. the pattern "1947" appears

to repeat ad infinitum).

How would one represent this function from the standpoint of efficient

knowledge representation? Assume that pattern computation is a non-issue

(relevant to AGI but not to this email or narrow discussion), and that a

reference to a character takes as much memory as the character itself in

practice.

1.) A naïve implementation would simply store the sequence of characters as

they are generated, since there is no guarantee that they will repeat

forever; the repetition is inferred only, we have no special knowledge of

the function.

2.) A more clever implementation would use a sequential statistical model

(assume it is a 4th order model, for the sake of the example), which would

code a long sequence of references to the "1947" node of the model. This

represents a substantial savings, requiring a single reference to store

instead of four characters. Quite a savings and giving a roughly 4:1

compression ratio after discounting the fixed overhead of the model. Let's

call the model overhead function OH(), which is a function of the number of

patterns stored in it (complex and exponential, but minor in some cases --

very important later).

3.) An even more clever implementation doesn't use a single statistical

model. Rather than modeling the data stream alone, it also models the

references to the models, and encodes references to *them* and so forth.

This works even better, giving logarithmic memory usage as a function of the

total characters stored minus model overhead.

Each of the implementation's is more clever with representation than the

previous, but you say "So what? It is a simple repeating pattern!

Representation doesn't matter!"

Let's change the assumptions a bit. Let's assume that the pattern only

repeats for 1,000 iterations before changing to another random sequence of

four characters (say, "6872"), and keeps changing every 1,000 iterations.

Lets call the number of different patterns "N". Implementation #1

faithfully records the characters "as is" like it always has, implementation

#2 makes adjustments to its statistical model with a modest increase in the

overhead for a space complexity of DATA/4 + OH(N), and implementation #3

faces a simple increase of Nlog DATA + OH(logN)*N.

Let's make the problem even more interesting. Let's assume that the

function rolls over to the beginning after 1,000 different patterns are

generated 1,000 times each. If we were talking about a UTM with infinite

memory, this would not be a problem, but since we live in this universe we

have to deal with the fact that everything is an FSM, with some fixed amount

of character/reference storage "M". It is conceivable that with

implementation #1, we never reach the rollover point where the function

repeats the original pattern because the machine runs out of memory after M

characters. Implementation #2 does better, at M/4 - OH(N). Implementation

#3 does even better at Nlog M + OH(logN)*N, with a vastly smaller memory

footprint than either #1 or #2 to losslessly encode the same pattern.

Again you ask, so what? One implementation is substantially more efficient

than the other, but that is a matter of degree.

Something very interesting happens when the function rolls over and repeats

the same pattern again. Assume the machine has enough memory that #2 has

space to record two full iterations of the "super-pattern" where the digit

patterns go back to the original pattern of "1947" and repeating the

sequence of apparently random pattern changes before running out of memory.

It follows that #1 will never see the rollover super-pattern and view the

sequence of pattern changes as truly random, and #3 will easily have enough

memory resources to record several iterations of the rollover super-pattern.

For the second iteration of the super-pattern, you see a new memory

consumption function: both #2 and #3 start to average less memory used per

character stored the more times the super-pattern repeats. #2

asymptotically approaches DATA/4 for a minor improvement (taking into

account that "DATA" is twice as large after two iterations of the

super-pattern), with the function OH(N) effectively becoming a constant.

However, #3 consumes *almost no additional memory*, just some negligible

additional model overhead, and it already has a more efficient model OH()

scalability than #2 (the details of why there is an effective difference in

model efficiency is out-of-scope here).

#2 and #3 represent implementations that are capable of "converging", which

is to say the memory/character stored can start decreasing as more of an

apparently FSM-generated pattern is encoded. Obviously, #2 and #3 represent

two very different implementational efficiencies, as #3 requires vastly less

memory to losslessly encode a pattern, something that becomes particularly

evident as the corpus of data to encode increases. What is also evident is

that the memory profiles are vastly different such that #3 could converge on

patterns that are sufficiently complex that #2 does not have enough memory

to converge in some finite amount of memory (and even if it did, it would be

comparatively wasteful of resources). One thing that I didn't really cover

is the pathological case of truly random data. In this case, #3 will

actually give the WORST performance of all three. #1, #2, and #3 are

actually different algorithm classes (evidenced to a certain extent by their

space complexity profiles). #1 guarantees space requirements but never

converges, #3 resource requirements fluctuate quite a bit within certain

hard limits depending on the apparent complexity of the data source but will

start to converge at the slightest hint of a pattern (obvious or subtle and

abstract), and #2 is somewhere in the middle. The example function was

painfully simple, and the scalability math is bit uglier to compute with

more real world datasets.

Which brings me to an important point. The efficiency of an implementation

is directly related to the Kolmogorov complexity of a pattern that an

implementation can losslessly store in a finite amount of memory. As in the

above example, the differences in practical efficiency can be extraordinary.

More efficient implementations are capable of recognizing and utilizing more

abstract patterns. There is actually multiple ways to do implementations

with the rough efficiency class of #3 (I've probably come up with three or

four myself) so it is really a class of algorithms, but that is beyond the

scope of this email. To ground this a bit, #1 is a brain-dead UTM model, #2

is represented in CS literature, and #3 is only described in AIT mathematics

(and my code base).

Now to bring up another important observation that I'm guessing most people

overlooked, probably in part because I didn't go into the implementational

details of #3, though it should be somewhat obvious upon reflection:

The entire state space of the #3 data structure will be almost devoid of

duplicate patterns. You can show why this should/must be the case from a

number of theoretical vectors (left as an exercise to the reader -- an easy

one).

The implication is important, though. It suggests that one has a structure

that approximates an exhaustive index of every pattern it has seen as an

intrinsic side-effect of its function. While there is some truth to this,

the problem is much uglier in practice and a much more complex theoretical

problem.

In any case, it is very late and my brain is numb. I may have screwed up

the OH() function for #3, but I can't think well enough right now to

ascertain whether or not it is correct. It looks like a footnote in this

email, but it is quite important and a small book in itself. I'll see about

making any corrections to glaring errors tomorrow.

The reference in a previous post was to a #3 class structure that converges

at very close to optimal that actually expresses a truly exhaustive pattern

index. Much harder than it sounds because the OH() function is a serious

bitch to persuade to play nice.

Cheers and Good Night,

-James Rogers

jamesr@best.com

**Next message:**Christian Szegedy: "Re: Pattern representation and Solomonoff"**Previous message:**Gordon Worley: "Re: [Fwd: more networking applications: tribe.net]"**Next in thread:**Christian Szegedy: "Re: Pattern representation and Solomonoff"**Reply:**Christian Szegedy: "Re: Pattern representation and Solomonoff"**Reply:**James Rogers: "RE: Pattern representation and Solomonoff"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]

*
This archive was generated by hypermail 2.1.5
: Wed Jul 17 2013 - 04:00:42 MDT
*