One trick that genuinely held up for us in low resource tonal ASR was adding syllable aligned F0 contour embeddings with an explicit tone bigram auxiliary loss. Instead of treating tone as a static class label per syllable, we extracted normalized F0 contours over each detected syllable nucleus. We represented each syllable with a short vector capturing slope, curvature, and relative pitch range after speaker level normalization. That vector was concatenated to the acoustic encoder output. On top of that, we trained a small auxiliary classifier to predict tone bigrams across adjacent syllables. The bigram objective forced the model to learn common tone sandhi transitions rather than memorizing isolated tones. The incident that convinced us came from spontaneous conversational Mandarin. The classic third tone sandhi case kept failing in fast speech. For example, in the utterance "ni3 xiang3 mai3 shen2me," the first "ni3" is realized more like a rising contour due to sandhi. Our baseline model often misrecognized "ni3 xiang3" as "ni2 xiang3" or even dropped tone distinctions entirely, leading to lexical errors. After introducing syllable level contour embeddings plus tone bigram loss, we saw overall WER drop from 21.4 percent to 18.7 percent on our conversational test set. More importantly, tone error rate on third tone sandhi contexts dropped by roughly 30 percent relative. The model stopped over penalizing sandhi realizations because it had learned that certain tone transitions are expected in sequence. The key insight was modeling tone as a contextual phenomenon. In real conversational audio, tone lives across syllables, not inside them.