Quinn

Quinn's Shortform

I should be more careful not to imply I think that we have solid specimens of computational reward functions; more that I think it's a theoretically important region of the space of possible minds, and might factor in idealizations of agency

Quinn's Shortform# Computational vs axiomatic reward functions

## Computational vs axiomatic in proof engineering

## AIXI-like and AIXI-unlike AGIs

### Conjecture: humans are AIXI-unlike AGIs

## Relation to mutability???

## Position of this post in my overall reasoning

I've had a background assumption in my interpretation of and beliefs about reward functions for as long as I can remember (i.e. since first reading the sequences), that I suddenly realized I don't believe is written down. Over the last two years I've gained experience writing coq sufficient to inspire a convenient way of framing it.

A proof engineer calls a proposition *computational* if it's proof can be broken down into parts.

For example, `a + (b + c) = (a + b) + c`

is *computational* because you can think of it's proof as the application of the associativity lemma then the application of something called a "refl", the fundamental termination of a proof involving equality. Passing around the associativity lemma is in a sense passing around *it's* proof, which assuming `a`

is inductive (take `nat`

; zero and successor) is an application of `nat`

's induction principle, unpacking the recursive definition of `+`

, etc.

In other words, if my adversary asks "why is `a + (b + c) = (a + b) + c`

I can show them; I only have to make sure they agree to the fundamental definitions of `nat`

and `+ : nat -> nat -> nat`

, the rest I can *compel* them to believe.

On the flip side, consider **function extensionality**, or `f = g <-> forall x, f x = g x`

, not provable because we do not know that the domain of `f`

(which equals the domain of `g`

) is countable, to name but one scenario. Because they can't prove it, theories "admit function extensionality as an axiom" from time to time.

In other words, if I invoke function extensionality in a proof, and my adversary has agreed to the basic type and function definitions, they remain entitled to reject my proof because if they ask why I believe function extensionality the best I can do is say "because I declared it on line 7".

We do not call reasoning involving axioms *computational*. Instead, the discourse has sort of become *poisoned* by the axiom; it's verificational properties have become weaker. (Intuitively, I can declare on line 7 anything I want; the risk of proving something that is actually false increases a great deal with each axiom I declare).

Apocryphally, a lecturer recalled a meeting perhaps of the univalent foundations group at IAS, when homotopy type theory (HoTT) was brand new (HoTT is based on something called univalence, which is about reasoning on type equalities in arbitrary "universes" ("kinds" for the haskell programmer)). In HoTT 1.0, univalence relied on an axiom (done carefully of course, to minimize the damage of the poison) and Per Martin-Lof is said to have remarked "it's not really type theory if there's an axiom". HoTT 2.0 called *cubical type theory* repairs this, which is why cubical tt is sometimes called *computational* tt.

If the space of AGIs can be carved into AIXI-like and AIXI-unlike with respect to goals, clearly AIXI-like architectures have goals imposed on them axiomatically by the programmer. The complement of course is where the reward function is computational; decomposable.

See the NARS literature of Wang et. al. for something at least adjacent to AIXI-unlike-- reasoning about NARS emphasizes that reward functions can be computational *to an extent*, but "bottom out" at atoms eventually. Still, NARS goals are computational *to a far greater degree* than AIXI-likes.

This should be trivial: humans can decompose their reward functions in ways richer than "because god said so".

If the space of AGIs can be carved into AIXI-like and human-like with respect to goals, does the computationality question help me reason about modifying my own reward function? Intuitively, AIXI's axiomatic goal corresponds to immutability. However, I don't think there's a for-free implication that AIXI-unlikes get self-modification for-free. More work needed.

In general, my basic understanding that the AGI space can be divided into what I've called AIXI-like and AIXI-unlike with respect to how reward functions are reasoned about, and that computationality (anaxiomaticity vs axiomaticity?) is the crucial axis to view, is deeply embedded in my assumptions. Maybe writing it down will make eventually changing my mind about this easier: I'm uncertain just how conventional my belief/understanding is here.

App and book recommendations for people who want to be happier and more productive

Great list! I had a huge ergonomic gain myself in the past few weeks:

Discussion with Eliezer Yudkowsky on AGI interventions

I'm uncertain about what abstractions/models are best to use for reasoning about AI. Coq has a number of options, which might imply different things about provability of different propositions and economic viability of those proofs getting anywhere near prod. Ought a type theorist or proof engineer concern themselves with IEEE754 edge cases?

In software engineering, what are the upper limits of Language-Based Security?

it looks less like "move everyone to Haskell and Coq" and more like "design a good core-library crypto API for X" and "reform common practices around npm".

+1, coq is my dayjob and I'm constantly telling people to be less messianic when they talk about formal verification.

by designing the APIs so that doing things the safe way is easy, and doing things the dangerous way requires calling functions with "dangerous" in the name

+1, want to extend it by paraphrasing the ancient folk wisdom that I think comes in flavors for both libraries and languages: "*things are bad when good behavior is impossible, things are good when bad behavior is impossible, everything comes in a spectrum of easier or harder bad or good behavior*", which can be read in terms of rewards and punishments (easy code is rewarding, hard code is punishing).

Study Guide

Could you say more about the value proposition of chemistry?

The value prop of physics as I understand it, and I'm pretty behind on physics myself, is

- classical mechanics is the bare bones proof of concept for working with "the scientific method"
- included in 1. is a bare bones instance of predicting and modeling, which gives you firm ground for your feet when you're predicting and modeling things that aren't so straight forward.
- if you're a logic freak it trains you to be comfortable with the insufficiency of symbols, memes and jokes of mathematicians and physicists disagreeing about notation reveals an underlying controversy about the question "what are symbols actually for?", and sophisticated thinking about this question is a major asset.
- if you're a logic freak it gets you out of your comfort zone
- practice with calculus/geometry.

Study Guide

I agree that tooling based on curry howard correspondences isn't great. We live in a somewhat primitive time. Maybe you're interested in my brief speculation about formal verification as a sort of dark horse in alignment, I look at FV and the *developer ecosystem* question which lags behind the *state of the art theory* question.

And, you mentioned proof and how it's disappointing that analysis is fodder for proofs, which I roughly agree with and roughly disagree with (analysis is nice because you can review and strengthen calculus while leveling up in logic. Parallelization!)--- I'd like to nominate logical foundations. It gave me a very thorough foundation after a couple years of community college, it's the kind of learning that you feel in your bones if you do it interactively. Executable textbooks in general are, I think, more bang for buck than classical textbooks.

**Meta edit**: hyperlink isn't working, softwarefoundations.cis.upenn.edu/ maybe this is better.

Study Guide

Love this! You're really mild on programming language theory and functional programming. Any comments on the omission?

Quinn's Shortform# capabilities-prone research.

# What are some alternatives to the splitting a dollar intuition?

I come to you with a dollar I want to spend on AI. You can allocate `p`

pennies to go to capabilities and `100-p`

pennies to go to alignment, but only if you know of a project that realizes that allocation. For example, we might think that GAN research sets `p = 98`

(providing 2 cents to alignment) while interpretability research sets `p = 10`

(providing 90 cents to alignment).

Is this remotely useful? This is a really rough model (you might think it's more of a venn diagram and that this model doesn't provide a way of reasoning about the double counting problem).

**a task**: rate research areas, even whole agendas, with such value `p`

. Many people may disagree about my example assignments to GANs and interpretability, or think both of those are too broad.

To say something is capabilities-prone is less to say a dollar has been cleanly split, and more to say that there are some dynamics that sort of tend toward or get pushed toward different directions. Perhaps I want some sort of fluid metaphor instead.

## Positive and negative longtermism

I'm not aware of a literature or a dialogue on what I think is a very crucial divide in longtermism.

In this shortform, I'm going to take a polarity approach. I'm going to bring each pole to it's extreme, probably each beyond positions that are actually held, because I think median longtermism or the longtermism described in the Precipice is a kind of average of the two.

Negative longtermism is saying "let's not let some bad stuff happen", namely extinction. It wants to preserve. If nothing gets better for the poor or the animals or the astronauts, but we dodge extinction and revolution-erasing subextinction events, that's a win for negative longtermism.

In positive longtermism, such a scenario is considered a loss. From an opportunity cost perspective, the failure to erase suffering or bring to agency and prosperity to

`1e1000`

comets and planets hurts literally as bad as extinction.Negative longtermism is a vision of what

shouldn'thappen. Positive longtermism is a vision of whatshouldhappen.My model of Ord says we should lean at least 75% toward positive longtermism, but I don't think he's an extremist. I'm uncertain if my model of Ord would even subscribe to the formation of this positive and negative axis.

What does this axis mean? I wrote a little about this earlier this year. I think figuring out what projects you're working on and who you're teaming up with strongly depends on how you feel about negative vs. positive longtermism. The two dispositions toward myopic coalitions are "do" and "don't". I won't attempt to claim which disposition is more rational or desirable, but explore each branch

When Alice wants future

`X`

and Bob wants future`Y`

, but if they don't defeat the adversary Adam they will be stuck with future`0`

(containing great disvalue), Alice and Bob may set aside their differences and choose form amyopic coalitionto defeat Adam or not.Form myopic coalitions. A trivial case where you would expect Alice and Bob to tend toward this disposition is if`X`

and`Y`

are similar. However, if`X`

and`Y`

are very different, Alice and Bob must each believe that defeating Adamcompletely hingeson their teamwork in order to tend toward this disposition, unless they're in a high trust situation where they each can credibly signal that they won't try to get a head start on the`X`

vs.`Y`

battle until`0`

is completely ruled out.Don't form myopic coalitions. A low trust environment where Alice and Bob each fully expect the other to try to get a head start on`X`

vs.`Y`

during the fight against`0`

would tend toward the disposition of not forming myopic coalitions. This could lead to great disvalue if a project against Adam can only work via a team of Alice and Bob.An example of such a low-trust environment is, if you'll excuse political compass jargon, reading bottom-lefts online debating internally the merits of working with top-lefts on projects against capitalism. The argument for coalition is that capitalism is a formiddable foe and they could use as much teamwork as possible; the argument against coalition is historical backstabbing and pogroms when top-lefts take power and betray the bottom-lefts.

For a silly example, consider an insurrection against broccoli. The ice cream faction can coalition with the pizzatarians if they do some sort of value trade that builds trust, like the ice cream faction eating some pizza and the pizzatarians eating some ice cream. Indeed, the viciousness of the fight after broccoli is abolished may have nothing to do with the solidarity between the two groups under broccoli's rule. It may or may not be the case that the ice cream faction and the pizzatarians can come to an agreement about best to increase value in a post-broccoli world. Civil war may follow revolution, or not.

Now, while I don't support long reflection (TLDR I think a collapse of diversity sufficient to permit a long reflection would be a tremendous failure), I think elements of positive longtermism are crucial for things to improve for the poor or the animals or the astronauts. I think positive longtermism could outperform negative longtermism when it comes to finding synergies between the extinction prevention community and the suffering-focused ethics community. However, I would be very upset if I turned around in a couple years and positive longtermists were, like, the premiere face of longtermism. The reason for this is once you admit positive goals, you have to deal with everybody's political aesthetics, like a philosophy professor's preference for a long reflection or an engineer's preference for moar spaaaace or a conservative's preference for retvrn to pastorality or a liberal's preference for intercultural averaging. A negative goal like "don't kill literally everyone" greatly lacks this problem. Yes, I would change my mind about this if 20% of global defense expenditure was targeted at defending against extinction-level or revolution-erasing events, then the neglectedness calculus would lead us to focus the by comparison smaller EA community on positive longtermism.

The takeaway from this shortform should be that

quinn thinks negative longtermism is better for forming projects and teams.