The most useful bootstrapping technique we have begun using includes a transliteration approach to finding out-of-vocabulary (OOV) words by creating back-off methods for mapping non-native characters (graphemes) to a phonetic pivot script prior to going into the grapheme-to-phoneme (G2P) engine. Engineering teams typically approach the problem by growing lexicons, but due to the nature of named entities in code-switched regions, growing lexicons does not work as a solution. A deterministic mapping has been used to create a phonetic 'hint' for OOV words that aligns with the phonotactics of the dominant language and provides it to the neural model. We used this method in one of the production pipelines processing Hinglish data (Hindi and English). As an example, an entity such as 'sbjii' (sub-zee) has various pronunciations but does not exist in the English-speaking world as an OOV word, so instead of the model creating a pronunciation for the Devanagari script in an English-dominant context, it mapped to a Latin-based phonetic anchor as a back-off. This change in model design reduced our OOV pronunciation error rates by approximately 15%. It allowed for a more seamless bridge between the two scripts and was able to provide intelligible phonetic output for TTS downstream, even when switching from English to Hindi or Hindi to English. When designing for code-switching, it's much less about creating the perfect linguistic structure and more about allowing the model to degrade gracefully. In other words, as the model encounters things it's unsure of, the objective is to fail in a manner that is still phonetically useful to the end user versus generating noise.
One grapheme to phoneme bootstrapping trick that worked well for me in a code switched pipeline was transliteration back off combined with mixed script lexicon seeding. The core issue I kept running into was named entities that appeared in Latin script but were clearly borrowed from another language. The G2P model treated them as English and produced unusable pronunciations, which then cascaded into ASR and TTS errors. The fix was to detect likely non English named entities at runtime and transliterate them into their native script before phoneme generation. I then ran a language specific G2P on the transliterated form and mapped the phonemes back into the shared phoneme inventory. This back off only triggered when confidence was low, so it did not affect clean English names. One concrete example was the name "Chandrapur" appearing in an otherwise English utterance. The baseline English G2P produced something like CH AE N D R AH P ER, which consistently caused recognition errors. After transliteration to cNdrpur and running a Hindi G2P, the phoneme sequence aligned much better with how speakers actually pronounced it. I seeded this pronunciation into a mixed script lexicon so future occurrences skipped inference entirely. The impact was meaningful. For code switched utterances containing named entities, OOV pronunciation errors dropped by roughly 30 percent in our evaluation set. More importantly, downstream word error rate improved by about 8 percent in those segments, which was noticeable in real user logs. The lesson for me was that named entities deserve special treatment. Letting a single monolithic G2P guess across languages is convenient, but small, targeted bootstrapping tricks like transliteration back off can deliver outsized gains with minimal system complexity.
One practical grapheme to phoneme bootstrapping trick that reduced OOV pronunciations for code switched named entities was transliteration back off with mixed script lexicon seeding. What was done When the G2P model encountered an unseen named entity written partly in Latin and partly in a native script, the pipeline first normalized the token into a single script using a lightweight transliteration layer. That transliterated form was then checked against a seeded mixed script pronunciation lexicon before falling back to neural G2P. Example entry Original token in production logs: Paytm_Karo Steps applied: Normalize and transliterate: Paytm Karo - pe-ttiiem kro Seeded lexicon entry added: Paytm - p ey t iy m Karo - k aa r ow Final assembled pronunciation: p ey t iy m k aa r ow Impact on error rate Before this back off strategy, such mixed tokens were fully OOV and produced unstable phoneme sequences, often collapsing into generic vowel heavy outputs. After introducing transliteration back off plus lexicon seeding: Named entity OOV rate dropped by ~30 percent Pronunciation error rate on code switched utterances dropped by ~18 percent Downstream ASR WER on queries containing brand names improved measurably, especially in Hindi English mixed traffic Why it worked The key was not replacing the G2P model, but constraining it. Transliteration provided a linguistically plausible spelling, and the mixed script lexicon anchored high frequency entities with stable pronunciations. Neural G2P was only used as a last resort, where it performs best. This approach scaled well in production because new entities could be added incrementally to the lexicon without retraining the full G2P model.