Editing Predictions by Modeling Model Computation
Code Paper In our last post, we introduced a task–component modeling–for understanding how individual components contribute to a model’s output. The goal there was to predict how a given model prediction would respond to “component ablations”—targeted modifications to specific parameters. We focused on a special “linear” case called component attribution, where we (linearly) decompose a model prediction into contributions from every model component, as shown below: We then presented a method, called COAR (Component Attribution via […]