Memoizing typechecking

Consider a term of the form f ?u1 ?u2, and consider the scenario where the unification variables ?u1 and ?u2 are inferred to be the same term e. In the 1-phase typechecker, the term e would have been typechecked just once. However, in the 2-phase typechecker, the term is elaborated to f e e and the second phase ends up typechecking e term twice, leading to unnecessary duplication of work.

This note is about making the typechecker aware of such sharing and memoizing the typechecking results. The basic idea is to maintain a cache in the typechecker that maps physical addresses of the terms to their computed types, and consult it before typechecking the term.

Scoping issues

We need to be careful about maintaining well-scoped terms in the cache. But we also don't want to cache the whole type environment with each term. So we rely on the locally nameless representation of terms, that the typechecker already maintains, and keep types with the names in the term. That way, the terms can be typechecked independent of the type environment. In the typechecker AST terminology, we already maintain a sort field with every bv -- we just need to make sure that this is used and populated consistently.

Even after that, we can have a sanity check, where if a term e finds the cache hit, the free variables of e are in the environment.

Preserving sharing

The cache hit rate is going to be heavily dependent on the typechecker consciously maintaining sharing of terms. One place where we break it is substitution. For e[v/x], even if x is not free in e, we currently traverse and output a fresh e during the substitution and this breaks sharing.

For this, we can rely on the free variables information at every AST node, that the typechecker already maintains, and not traverse the term if the variable that we are substituting does not appear free in it. We need to benchmark if this is going to be a performance hit.

Rolling back the cache

We need to be careful about caching the results of typechecking paths that are rolled back or discarded during unification. We can maintain the cache along with the Union-Find graph that the unifier maintains and roll it back whenever the UF graph is.

Second-level cache

We can also implement a second-level cache that maps hash of the term e to the tuple (e, t), where t is its type. Once there is a hit in the cache for some term e', we would avoid the possibility of hash-collision by checking eq_term e e' before taking the type t from the cache entry. If it is indeed a successful hit, the entry (&e', t) will be promoted to the first-level cache.

To avoid hash computations again and again, we can memoize the hash of the term in the AST node.

Some initial attempts and discussions, 21st Jan. 2019

_memo_tc branch contains code for second level cache, which maintains a table of term hashes and their computed type. In case there is a hit in the table, it confirms that the terms are indeed same using eq_tm. For hashing, it is currently using the generic hash functions from OCaml and F#.

One issue with using physical addresses is that they can change with GC.

Another issue with using the caches in the first phase is that implicit resolution does not play nice. E.g.

module T

let id #a (x:a) = x

let ff (g : int -> int) (h : unit -> unit) = 42

let _ = ff id id  //each id is elaborated differently in the first phase, so just caching id is not good

So may be we should use the caches in second phase only?

Provide feedback

Saved searches