Document Triage

Document triage is the process of converting a set of digital files into well-defined text documents. It is one of two stages of text pre-processing.
Document triage process may involve one or more of the following steps based on the origin of the files being processed;
Character encoding identification – For any document to be machine readable, the characters and numbers should be represented in a character encoding. Character encoding is to store text as binary data and we have different character encoding schemes (ASCII, Unicode, UTF). Character encoding identification step is to determine the character encoding used in a text file.
Language identification – A document may consist of texts in a single language or multiple languages. This step is to identify the language(s) used in the document.
Text sectioning - Identifies the actual content within a file while discarding undesirable elements, such as images, tables, headers, links, and HTML formatting.


