We can pull them apart by indexing and slicing them, and we can join them together by concatenating them. However, we cannot join strings and lists:

We can pull them apart by indexing and slicing them, and we can join them together by concatenating them. However, we cannot join strings and lists: If we use a for loop to process the elements of this string, all we can pick out are the individual characters — we don't get to choose the granularity.

2 Strings: Text Processing at the Lowest Level

By contrast, the elements of a list can be as big or small as we like: So lists have the advantage that we can be flexible about the elements they contain, and correspondingly flexible about any downstream processing.

Consequently, one of the first things we are likely to do in a piece of NLP code is tokenize a string into a list of strings 3. Conversely, when we want to write our results to a file, or to a terminal, we will usually format them as a string 3. Lists and strings do not have exactly the same functionality.

Lists have the added power that you can change their elements: However, lists are mutable, and their contents can be modified at any time. As a result, lists support operations that modify the original value rather than producing a new value. Consolidate your knowledge of strings by trying some of the exercises on strings at the end of this chapter.

The concept of "plain text" is a fiction. In this section, we will give an overview of how to use Unicode for processing texts that use non-ASCII character sets.

Unicode supports over a million characters. Each character is assigned a number, called a code point.

Old English / Anglo-Saxon

Within a program, we can manipulate Unicode strings just like normal strings. However, when Unicode characters are stored in files or displayed on a terminal, they must be encoded as a stream of bytes. Some encodings such as ASCII and Latin-2 use a single byte per code point, so they can only support a small subset of Unicode, enough for a single language.

Other encodings such as UTF-8 use multiple bytes and can represent the full range of Unicode characters. Text in files will be in a particular encoding, so we need some mechanism for translating it into Unicode — translation into Unicode is called decoding.

Conversely, to write out Unicode to a file or a terminal, we first need to translate it into a suitable encoding — this translation out of Unicode is called encoding, and is illustrated in 3. Unicode Decoding and Encoding From a Unicode perspective, characters are abstract entities which can be realized as one or more glyphs.

Only glyphs can appear on a screen or be printed on paper. A font is a mapping from characters to glyphs. Extracting encoded text from files Let's assume that we have a small text file, and that we know how it is encoded.

This file is encoded as Latin-2, also known as ISO It takes a parameter to specify the encoding of the file being read or written.3 Processing Raw Text.

The most important source of texts is undoubtedly the Web. It's convenient to have existing text collections to explore, such as the corpora we saw in the previous chapters. How can I encourage students to read online stories? Are novels available online?

How can students get involved in creating online stories? Explore online reading including interactive stories, articles, and books.

Involve readers in writing stories including adding to stories, writing new endings. The picture of a vulture, this represents the sound of a "glottal stop" (or "glottal plosive"), which is a brief closing of the wind pipe, like a little cough.

This first grade writer's workshop bundle is aligned to the Common Core standards and helps students work through the writing process. Students will brainstorm, draw, write, revise, edit and publish their own writing pieces for each genre of writing. Old English was the West Germanic language spoken in the area now known as England between the 5th and 11th centuries.

Speakers of Old English called their language Englisc, themselves Angle, Angelcynn or Angelfolc and their home Angelcynn or Englaland. Old English began to appear in writing . MiniBooks Click here to go to the list of Mini Books Emergent Readers, Easy to Read Books and Theme Books.

