[Submitted on 31 Jan 2023]

Download PDF

Abstract: The widespread use of spreadsheet environments by billions of users presents
a unique opportunity for formula-authoring assistance. Although large language
models, such as Codex, can assist in general-purpose languages, they are
expensive to train and challenging to deploy due to their large model sizes (up
to billions of parameters). Moreover, they require hundreds of gigabytes of
training data. We present FLAME, a T5-based model trained on Excel formulas
that leverages domain insights to achieve competitive performance with a
substantially smaller model (60M parameters) and two orders of magnitude less
training data. We curate a training dataset using sketch deduplication,
introduce an Excel-specific formula tokenizer for our model, and use
domain-specific versions of masked span prediction and noisy auto-encoding as
pretraining objectives. We evaluate FLAME on formula repair, formula
auto-completion, and a novel task called syntax reconstruction. FLAME (60M) can
outperform much larger models, such as Codex-Davinci (175B), Codex-Cushman
(12B), and CodeT5 (220M), in 6 out of 10 settings.

Submission history

From: Harshit Joshi [view email]


[v1]
Tue, 31 Jan 2023 17:29:43 UTC (230 KB)

Read More