Autoencoder transformer
auto_circuit.model_utils.sparse_autoencoders.autoencoder_transformer
A transformer model that patches in sparse autoencoder reconstructions at each layer. Work in progress. Error nodes not implemented.
Attributes
Classes
AutoencoderTransformer
AutoencoderTransformer(wrapped_model: Module, saes: List[SparseAutoencoder])
Bases: Module
Source code in auto_circuit/model_utils/sparse_autoencoders/autoencoder_transformer.py
Functions
factorized_dest_nodes
factorized_dest_nodes(model: AutoencoderTransformer, separate_qkv: bool) -> Set[DestNode]
Get the destination part of each edge in the factorized graph, grouped by layer. Graph is factorized following the Mathematical Framework paper.
Source code in auto_circuit/model_utils/sparse_autoencoders/autoencoder_transformer.py
factorized_src_nodes
factorized_src_nodes(model: AutoencoderTransformer) -> Set[SrcNode]
Get the source part of each edge in the factorized graph, grouped by layer. Graph is factorized following the Mathematical Framework paper.
Source code in auto_circuit/model_utils/sparse_autoencoders/autoencoder_transformer.py
sae_model
sae_model(model: HookedTransformer, sae_input: AutoencoderInput, load_pretrained: bool, n_latents: Optional[int] = None, pythia_size: Optional[str] = None, new_instance: bool = True) -> AutoencoderTransformer
Inject
SparseAutoencoder
wrappers into a transformer model.