**From:** James Rogers (*jamesr@best.com*)

**Date:** Sun Dec 03 2000 - 20:09:16 MST

**Next message:**Ben Goertzel: "RE: Is generalisation a limit to intelligence?"**Previous message:**Joaquim Almgren Gāndara: "Re: Is generalisation a limit to intelligence?"**In reply to:**Ben Goertzel: "RE: Is generalisation a limit to intelligence?"**Next in thread:**Ben Goertzel: "RE: Is generalisation a limit to intelligence?"**Reply:**Ben Goertzel: "RE: Is generalisation a limit to intelligence?"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]

*>> Generally speaking, given a finite amount of memory and an
*

*>> arbitrarily long sequence of data (generated by any finite state machine
*

*>> no matter how complex), it is possible to attain the minimum possible
*

*>> predictive error rate using universal prediction schemes.
*

*>
*

*> However, this kind of "in principle" calculation is not very useful in practice,
*

*> now is it? only in very narrow domains, like text compression...
*

Yes and no, depending on your assumptions. I actually agree with

you to the extent that you really don't need to know exactly what

the minimum predictive error rate is as long as your algorithms

approach them reasonably quickly. Most of the limits are

computational -- doable, but very expensive for unrestricted domains.

However, I wouldn't (and don't) try to apply these calculations to a

single complex domain (no need to rehash the same old problems of AI).

Rather I have algorithms that automatically and adaptively partition

domains (in a manner that preserves all heirarchical, associative,

etc. context) such that good prediction remains computationally

tractable for any given domain space without significantly impacting

prediction error rates for the complex virtual domain (query) that it

is currently operating on; in fact, the decision of a node to

partition itself is driven largely by recognizing when a domain has

become computationally inefficient in the sense that good results

require excessive CPU churn (itself a relatively complex algorithm).

It is really just a matter of getting around the computational

complexity of the core computation without significantly sacrificing

results.

The better text compression algorithms, such as Lempel-Ziv and

arithmetic coding, actually are based on universal prediction schemes

(entropy and prediction being closely related), and are therefore

applicable to any sequence of data, not just text, though that is

where it is most commonly used. LZ is not actually an optimal

universal prediction scheme, but is an excellent universal predictor

for systems with very finite amounts of memory (frequently the

textbook example in fact). Optimal universal predictors asymptotically

approach theoretical error rates very quickly, but also tend to have

exponential memory consumption, giving rapidly diminishing returns.

Therefore, for many finite memory applications, non-optimal predictors

can actually produce better results in the given amount of resource

space even if they approach the theoretical minimum error rates much

more slowly.

My own experience indicates that it doesn't really matter if you use

an optimal universal predictor. However, there is a very broad

spectrum in performance for universal predictors such that if you

choose a poor one, it is likely to be essentially worthless for

real-world problem spaces. A well-designed system with poor

prediction functions will appear to have poor intelligence during

your lifetime even though the prediction functions may converge enough

to give good performance at some time far in the future.

*> Then, if the system has enough time to learn, EC (or more simply,
*

*> Monte Carlo search over program space), will cause the system to
*

*> arrive at a program that can achieve its goals optimally
*

And I thought my methods were computationally expensive... :^)

*> Since none of these conditions obtain, we need something much more
*

*> specialized and more complicated than EC inside our thinking
*

*> machine...
*

I'm not a big fan of EC as a means of generating a thinking machine.

Aside from the arguable inelegance of the result (depending on your

point of view), there necessarily must exist a large number of cleaner

and more efficient methods, though requiring more specialized

architecture.

*>> In short, it has been demonstrated that for any finite
*

*>> state machine, it is possible to ascertain the minimum possible
*

*>> predictive error rate for any data sequence given any finite amount
*

*>> of memory.
*

*>
*

*> Yes, but this method will not perform adequately under the
*

*> conditions under which a real intelligent systems have to operate.
*

I would disagree; it depends on how you use it.

*> The problem is that the prediction schemes that are "optimal" under
*

*> standard mathematical assumptions, are NOT optimal given the
*

*> real-world conditions under which organisms operate.
*

Pretty much by definition, an "optimal" universal predictor is

impervious to real-world conditions, whatever those may be, hence the

term "universal". The algorithms are model independent. A truly

optimal predictor would of course require infinite memory. However, it

is possible to have universal predictors that are optimal for any

given resource configuration. Even "good" universal predictors will do

the job, they just take longer to converge.

*> Yes, it answers the question under the glaringly false assumption
*

*> that minds embody general optimal predictive schemes
*

Optimality is not required for good predictive performance, it merely

sets the standard for how good the performance can get.

*> These experiments certainly do not demonstrate that humans are
*

*> finite-state machines. There are many many other explanations for
*

*> this data. I won't bore you by reciting them.
*

I never claimed that this was a proof. The problem is that it can't

be disproved that the human mind is a finite-state machine. Humans

have a knack for getting mathematically classified as FS machines via

their information theoretic behavior. It doesn't matter to me one way

of the other, it was merely an observation.

*> Most of what makes real minds interesting is NOT about optimal
*

*> prediction or modeling, but about pretty good ways of achieving
*

*> pretty good intelligence within very limited space and time
*

*> resources. This is a whole different story from information theory.
*

Obviously my original post came across as incomplete (the hazards of

sending email at 2am), but I was talking about theoretical limits

of prediction for any finite amount of resources i.e. given a certain

amount of memory and processor, how accurately can a program predict

for any arbitrary model. That sounds a lot like predicting the limits

of intelligence in any finite resource space.

*> For example, in the area of computational linguistics, Denis Yuret's
*

*> excellent MIT PhD thesis from a few years back uses information
*

*> theory to model language ("lexical attraction" he calls it). All
*

*> well and good. It doesn't help you deal with the translation of
*

*> language into meaning. I think I know how to do the latter, but not
*

*> using information theory explicitly....
*

I have a small pile of links and papers on this topic that admittedly

I haven't read for the most part. However, it was my understanding

that these language models offer the best context inference

performance of any other language analysis methodology to date, and by

a good margin. However, I don't really construe "meaning" to be much

different than "context", which you may not.

I should probably start reading these...

*> I mean, how do you construct a mind given current computational
*

*> resources and real-time learning constraints, inspired primarily by
*

*> information theory?
*

First of all, the architecture looks a lot like many other modern

architectures, but uses information theory to fill in some holes and

resolve some hard problems; information theory is not fundamental to

the architecture, only fundamental to the way some components of the

architecture behave.

Scalability is a serious issue given current day hardware constraints,

but I've personally minimized that to the maximum extent possible. The

kernel layer, which effectively acts as an operating system for the

mind, is a highly scalable database kernel (borrowed from past

experience) that is both SMP and cluster optimized, but which has

been tuned and modified for this particular application. This supports

an arbitrarily large address space, transparent thread/process

migration over the network, full state recovery for both single-node

and total system failures, load management, and a bunch of other

features that are nice to have. Standard stuff for the most part.

At the application level, you have a multitude of "agents", between

which are a complex network of associative and heirarchical

relationships. Each agent is effectively a single domain (and quite a

bit more, being active, sometimes goal-oriented components). Nothing

to special in this aspect.

Under load, domains actively partition themselves into multiple

agents/domains in realtime to minimize theoretical prediction error

rates as a function of resource usage and few other things while still

maintaining informational integrity. This is probably the most

critical info theory driven feature and it saves a *lot* of clock

cycles while delivering superior results. Queries against the

entire system also require info theory derived algorithms and

mathematics, but I'll skip over that topic here. In the

simplest case, you can start out with a single empty domain, start

shoving information into it, and then let the network build itself.

The resulting system, while containing everything fed to it will by

its nature extract all the context and information to within fairly

close to the theoretical limit, with some good weighting/noise

reduction thrown in. It isn't so important here, but how the data

organizes itself allows queries against the system to be computed

quickly and with a low predictive error rate that is pretty close

to the theoretical limits. Against a system that did not organize

itself in something closely approximating this manner, queries with a

similar predictive efficiency would become computationally

intractable. How the domains actually partition is driven to a

certain extent by how the system is queried and used, as this can help

define certain types of computation efficiency. The primary benefit of

this kind of automatic and adaptive partitioning (other than keeping

the domain spaces clean and computationally efficient) is that it

allows the system to handle dirty and raucous data sources quickly and

gracefully without human intervention or serious system pollution. The

adaptive partitioning scheme was originally a noise reduction

algorithm I developed for data-mining purposes, but has proven to be

very versatile across a broad number of spaces (interestingly, it is

algorithmically similar to high-end noise reduction algorithms in

signal processing).

I glossed over a lot, but as you can see, info theory largely governs

organizational behaviors to allow high predictive efficiency in a

computationally reasonable manner and to weed out garbage. Other than

that, it mostly looks like yet another giant network of motivated

super-neurons, albeit optimized for silicon.

Cheers,

-James Rogers

jamesr@best.com

**Next message:**Ben Goertzel: "RE: Is generalisation a limit to intelligence?"**Previous message:**Joaquim Almgren Gāndara: "Re: Is generalisation a limit to intelligence?"**In reply to:**Ben Goertzel: "RE: Is generalisation a limit to intelligence?"**Next in thread:**Ben Goertzel: "RE: Is generalisation a limit to intelligence?"**Reply:**Ben Goertzel: "RE: Is generalisation a limit to intelligence?"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]

*
This archive was generated by hypermail 2.1.5
: Wed Jul 17 2013 - 04:00:35 MDT
*