English | October 23, 2018 | ASIN: B07JMXDFLW | 420 pages | AZW3 | 1.45 MB
The World Wide Web comprises the biggest current source of text messages coded in a huge assortment of 'languages'. A possible and sound way of taking advantage of this information for terminology analysis is to gather a fixed corpus for a given terminology. There are several adavantages of this approach: (i) Working with such corpora obviates the issues experienced when using Google in quantitative terminology analysis (such as non-transparent position algorithms). (ii) Creating a corpus from web information is actually free. (iii) The dimensions of corpora collected from the WWW may surpass by several purchases of magnitudes the length of terminology sources provided elsewhere. (iv) The information is regionally available to the user, and it can be linguistically post-processed and queried with the resources liked by her/him. This book details the main realistic projects in enhancing web corpora up to giga-token dimension. Among these projects are the testing process (i.e., web crawling) and the regular cleaning such as boilerplate elimination and elimination of copied content. Linguistic handling and issues with terminology handling coming from the different kinds of disturbance in web corpora are also protected. Lastly, the writers show how web corpora can be analyzed and in comparison to other corpora (such as typically collected corpora).
TO MAC USERS: If RAR password doesn't work, use this archive program:
RAR Expander 0.8.5 Beta 4 and extract password protected files without error.
TO WIN USERS: If RAR password doesn't work, use this archive program:
Latest Winrar and extract password protected files without error.