Systems, methods, and computer-readable non-transitory storage medium in which a statistical
machine translation model for formatting medical reports is trained in a learning phase using bitexts and in a tuning phase using manually transcribed dictations. Bitexts are generated from automated
speech recognition dictations and corresponding formatted reports, using a series of steps including identifying matches and edits between the dictations and their corresponding reports using
dynamic programming, merging matches with adjacent edits, calculating a
confidence score, identifying acceptable matches, edits, and merged edits, grouping adjacent acceptable matches, edits, and merged edits, and generating a plurality of bitexts each having a predetermined maximum
word count (e.g., 100 words), preferably with a predetermined overlap (e.g., two thirds) with another bitext. During the tuning phase, the
system is trained by iteratively translating manually transcribed dictations and adjusting the relative model weights until best performance on error rate criteria (e.g., WER and CDER).