Shakespeare Redux
As an example of a small corpus of natural language the complete works of William Shakespeare as published by Gutenburg Press consist of 1/8 million lines, just under one million words, and 5.46 megabytes of text are freely available from MIT.
Some format is not entirely consistent across this corpus but generally so with the text being largely just Shakespeare’s words with 170 lines of preamble, 400 lines of licensing post notes and 220 copyright notices interspersed.