Featured: Connecting Publishers with Subject Matter Experts

1 Answers

Sovic Chakrabarti

Director at Icy Tales

Answered 3 months ago

When I worked on semantic shift with historical newspapers, the single step that mattered most was doing OCR post correction before embedding alignment, not after. Early on, I assumed embeddings would smooth over OCR noise. They did not. Instead, the models happily learned the noise and then alignment just preserved it across decades. The correction step that helped most was lexicon guided normalization focused on high frequency content words, not everything. We built a short list of historically stable terms from clean later corpora and forced early OCR variants to map back to those forms when confidence was high. For example, long s errors, broken ligatures, and spacing issues were fixed only when context strongly supported the correction. This avoided over correcting genuinely archaic spellings. Only after that did we do embedding alignment using an orthogonal Procrustes approach with anchor words that were manually vetted for semantic stability across time. The key was that the anchors were drawn from the corrected text, not the raw OCR output. That reduced drift caused by garbage tokens. One term where this flipped my interpretation was "strike." Before correction, the model suggested an early politicization in the late nineteenth century. After post correction, that signal weakened significantly. What had looked like a semantic shift toward labor action was actually driven by OCR errors conflating "strike" with fragmented forms of "struck" and "striking" in sports and weather reporting. Once corrected, the labor sense still emerged, but closer to the early twentieth century, which aligned better with historical context. That experience made me cautious. If semantic shift results look dramatic too early, I now assume noise first. Careful OCR correction plus conservative alignment turned flashy but wrong stories into slower, more believable ones.

When tracking semantic shift in historical newspapers, what's one embedding alignment or OCR post-correction step you found that kept results from just mirroring noise? Can you point to one term where that step flipped your interpretation?

1 Answers

Sovic Chakrabarti

Related Questions

When tracking semantic shift in historical newspapers, what's one embedding alignment or OCR post-correction step you found that kept results from just mirroring noise? Can you point to one term where that step flipped your interpretation?

1 Answers

Sovic Chakrabarti