Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching
Derek Paulsen
"Blocking is a major task in entity matching. Numerous blocking
solutions have been developed, but as far as we can tell, blocking
using the well-known tf/idf measure has received virtually no atten-
tion. Yet, when we experimented with tf/idf blocking using Lucene,
we found it did quite well. So in this paper we examine tf/idf block-
ing in depth. We develop Sparkly, which uses Lucene to perform
top-k tf/idf blocking in a distributed share-nothing fashion on a
Spark cluster. We develop techniques to identify good attributes
and tokenizers that can be used to block on, making Sparkly com-
pletely automatic. We perform extensive experiments showing that
Sparkly outperforms 8 state-of-the-art blockers. Finally, we pro-
vide an in-depth analysis of Sparkly’s performance, regarding both
recall/output size and runtime. Our findings suggest that (a) tf/idf
blocking needs more attention, (b) Sparkly forms a strong baseline
that future blocking work should compare against, and (c) future
blocking work should seriously consider top-k blocking, which
helps improve recall, and a distributed share-nothing architecture,
which helps improve scalability, predictability, and extensibility."