Terrified Comments on Corrigibility in Claude’s Constitution
(Previously: Prologue.) Corrigibility as a term of art in AI alignment was coined as a word to refer to a property of an AI being willing to let its preferences be modified by its creator. Corrigibility in this sense was believed to be a desirable but unnatural property that would require more theoretical progress to specify, let alone implement. Desirable, because if you don’t think you specified your AI’s preferences correctly the first time, you want to be […]