# Encapsulation – From Bits to Gigabytes in 200 Years

Encapsulation comes in many forms and in my opinion is often overlooked and under reported as a major concept in many IT, telecomms, computer, networking or other binary based technical courses.

We don't even teach visualising "encapsulation" with the perfect explanatory image of a Russian Doll, yet most children are familiar with that.

For me, encapsulation is about the most important concept in computing, as our language is also based on it and follows rules or protocols – letters make words, words make sentences, sentences make paragraphs, to pages, chapters, books, libraries and so on; our mathematical systems are described by it; decimal units are grouped in tens, tens in hundreds etc.

Encapsulation of data organised as useful information.

Exactly the concept we used to create computer systems and be able to program them to be useful tools in the first place.

If you don't understand the difference between data and information, read my 2nd year degree level Databases research paper here:

The basic principle of encapsulation, where smaller units of something get grouped into fixed size, larger and larger containers for ease of transport and administration is nothing new, so why is this important concept – and the protocols that define those methods left to the side so much, if broached at all – in many beginner level tech courses?

A brief explanation of data processing (a data stream or data storage in some form – no matter how long or short the time period taken) should be obvious to mention in terms of how it is deciphered, according to relevant rules or protocols, should be explained at the start of any course, because it's everywhere in digital computing in some form at every moment, whether in a wordprocessor or in an Internet subsea backbone optical fibre!

Without encapsulation according to a protocol, everything is just a seemingly random collection of 1s and 0s, either streamed or held in a static magnetic or electrical field until read.

So how are these bits structured and re-arranged to have meaning in human reality? The Database paper outlines a database specific set of methods, but there are many other ways – as you would expect if it's "everywhere" – and nowadays, it pretty much is!

I'd define the Internet that way – everywhere – if you have the right gear, anywhere on the planet, you can connect to a satellite at least.

What are some ways it can be shown more specifically in the various systems that make up a computer environment? In most systems, bits are most commonly stored either on a hard drive or memory chips. First, an example of hard drive encapsulation.

In a directory or file listing using ls -l of various formats, there's a lesser known column, before the modification date,that states numerical info on those files – but what is it?

You could google and get an answer immediately, but let's see if a linux command jigsaw puzzle approach can work with research, to learn some new commands in the process, by seeing if man can supply info on relevant terms.

First, create an empty file:

Mint5630 SEDAWK # touch testfile.txt

Mint5630 SEDAWK # ls -l testfile.txt

-rw-r–r– 1 stevee stevee 0 Oct 3 16:28 testfile.txt

First, briefly, the other better known fields in that output are:

file type (-); read,write,exe (rwx) permissions (x3) for user,group,others; file links (1); user (owner, stevee); group (membership, stevee); ??(0); modification time (date); filename

So what does this number 0 mean before the modification time/date stamp?Is it the same as "allocated size in blocks" as stated in man, as we know it's an empty file?

man ls

-s, –size

print the allocated size of each file, in blocks

stevee@Mint5630 ~ \$ ls -ls testfile.txt

0 -rw-r–r– 1 stevee stevee 0 Oct 3 16:28 testfile.txt

No, because this gives a new field (0) at the start, though the number is the same; so has it just been repeated or is it something else? Unix wouldn't be pointlessly repetitive in that way.

If I append some data to the empty file, say, the A character, without a new line character, will anything change?:

stevee@Mint5630 ~ \$ echo -n A >> testfile.txt

stevee@Mint5630 ~ \$ ls -ls testfile.txt

4 -rw-r–r– 1 stevee stevee 1 Oct 3 16:43 testfile.txt

Ah, now there is an indication of something else – but what?

The 0 changes to a 1, of whatever it is – bit?; byte?; block? – and the first 0 changes to a 4

stevee@Mint5630 ~ \$ ls -lsh testfile.txt

4.0K -rw-r–r– 1 stevee stevee 1 Oct 3 16:43 testfile.txt

Seems the 1 does not measure a standard data bit size multiple in terms of bits or bytes etc., but the 4K does.

If the smallest unit of data is a 1 or 0, or a "bit", what are the various ways they can accumulate and be named as databit multiples? It turns out, depending on what they are being used for to describe useful information – say characters, numbers, packets etc. – they can accumulate in many ways with many different names, for both historic and current technological reasons.

Better find out what each of the most fundamental bits are first, then what fundamental groupings they can become regardless of what medium they are stored on or streamed across.

## The trend in hardware design converged on the most common implementation of using eight bits per byte, as it is widely used today. However, because of the ambiguity of relying on the underlying hardware design, the unit octet was defined to explicitly denote a sequence of eight bits."

So, small k + small b = kilobit = 1000 bits

Large B = byte = octet = 8 bits

"Most UNIX filesystem types have a similar general structure, although the exact details vary quite a bit. The central concepts are superblock, inode,data block, directory block, and indirection block. The superblock contains information about the filesystem as a whole, such as its size (the exact information here depends on the filesystem). An inode contains all information about a file, except its name. The name is stored in the directory, together with the number of the inode. A directory entry consists of a filename and the number of the inode which represents the file. The inode contains the numbers of several data blocks, which are used to store the data in the file. There is space only for a few data block numbers in the inode, however, and if more are needed, more space for pointers to the data blocks is allocated dynamically. These dynamically allocated blocks are indirect blocks; the name indicates that in order to find the data block, one has to find its number in the indirect block first…

5.10.5. Filesystem block size

The block size specifies size that the filesystem will use to read and write data. Larger block sizes will help improve disk I/O performance when using large files, such as databases. This happens because the disk can read or write data for a longer period of time before having to search for the next block. On the downside, if you are going to have a lot of smaller files on that filesystem, like the /etc, there the potential for a lot of wasted disk space. For example, if you set your block size to 4096, or 4K, and you create a file that is 256 bytes in size, it will still consume 4K of space on your harddrive. For one file that may seem trivial, but when your filesystem contains hundreds or thousands of files, this can add up."

So, it seems my testfile.txt – may be made up of one or more data blocks, each of a size that depends on what has been set by the file system type in current use, and the amount of those blocks used for the file space indicates the total size allocated on disk.

How do you find out what the default block size of you FS is? First, you have to know your FS type:

stevee@Mint5630 ~ \$ sudo blkid

/dev/sda1: UUID="436a55a9-f610-45f6-866d-a72bfe10ff74" TYPE="ext4"

man ext4

DESCRIPTION

The second, third, and fourth extended file systems, or ext2, ext3, and

ext4 as they are commonly known, are Linux file systems that have his‐

torically been the default file system for many Linux distributions.

This ext4 feature allows the mapping of logical

block numbers for a particular inode to physical

blocks on the storage device to be stored using an

extent tree, which is a more efficient data struc‐

ture than the traditional indirect block scheme used

by the ext2 and ext3 file systems. The use of the

improves file system performance,and decreases the

needed to run e2fsck(8) on the file system.

I can find out the sector size and block size of my hard drive:

stevee@Mint5630 ~ \$ sudo blockdev –report /dev/sda1

RO RA SSZ BSZ StartSec Size Device

rw 256 512 4096 2048 37860933632 /dev/sda1

Seems the actual physical disk sector size is 512 bytes long , with 8 of those creating an EXT4 blocksize of 4096 bytes.

So now, it seems I found enough info to assume that my tiny, 1 character content testfile.txt occupies 1 ext4 file system block (the minimum possible amount to store actual data, rather than mark it as 0 blocks, or empty) of data, of 4096 bytes or 4K space size that has been allocated for it, even if it does not fill it.

Indeed, this is what the ls -lsh output showed above:

stevee@Mint5630 ~ \$ ls -lsh testfile.txt

4.0K -rw-r–r– 1 stevee stevee 1 Oct 3 16:43 testfile.txt

Initially, the sector allocation size does not seem to fit with what the default output of stat says:

stevee@Mint5630 ~ \$ sudo stat testfile.txt

File: 'testfile.txt'

Size: 1     Blocks: 8   IO Block: 4096 regular file

Device: 801h/2049d    Inode: 535929 Links: 1

Access: (0644/-rw-r–r–) Uid: ( 1000/ stevee) Gid: ( 1000/ stevee)

Access: 2015-10-02 23:15:01.695360490 +0100

Modify: 2015-10-02 23:14:57.163233200 +0100

Change: 2015-10-02 23:14:57.163233200 +0100

Birth: –

This implies 8 blocks of 4096 bytes each, or 2^15 or 32768 bytes, or 32K of space allocated – which there obviously is not – so what is wrong here?

This seems to be the different stat definition of a "block", as different from the ext4 "block" shown earlier. A difference between the base hard drive sector size block of 512 bytes, and how the ext4 default filesystem has been organised, as defined for ext 4 above – in file system block size of 4096 bytes or 4K bytes. This would then make sense, as 8 blocks of 512 bytes per sector, make 1 file system IO block of 4096 bytes.

Be aware of the base sector size of a hard drive, organised (as multiples) to form a particular file system IO block, of given size. One FS block encapsulates 8 drive sectors, to form a larger storage/transport unit!

We have seen this sector to file system variation already with the dd comand Posts, where you can change the IO blocksize to suit your hardware's optimal performance to speed up copying time:

Due to shell aliases and built-in ```stat' functions, using an ```

``` ```

`unadorned `stat' interactively or in a script may get you different

functionality than that described here. Invoke it via ```env' (i.e., ```

``` ```

env stat …') to avoid interference from the shell.

```-c' ```

``` ```

–format=FORMAT'

Use FORMAT rather than the default format. FORMAT is

automatically newline-terminated, so running a command like the

following with two or more FILE operands produces a line of output

for each operand:

\$ stat –format=%d:%i / /usr

2050:2

2057:2

The valid FORMAT directives for files with `--format' and `–printf'

are:

* %b – Number of blocks allocated (see ```%B') ```

``` ```

`* %B - The size in bytes of each block reported by `%b'

So now, if you use the –format option with the %b or %B, things become clearer again, as sector block sizes and amount are stated by the %b and %B formats:

Mint5630 stevee # stat –format %b testfile.txt

8

Mint5630 stevee # stat –format %B testfile.txt

512

This also gives the same values of 8 sector blocks x 512 bytes = 4096 bytes or 4K as ls -lsh does for this file.

So far, all that explains the disk and file sytem storage allocated to store the file, but what about the file contents itself?

Interestingly at this point, the nemo-terminal shows two different values for testfile.txt size – the command line shows IO block allocation, and the GUI shows the file contents size – 1 character per byte? – handy if so!

This allows a theory check for whether each character contained in a file IS a byte in size – let's append some more to it:

stevee@AMDA8 ~/Documents \$ echo -n BCDEFGH >> testfile.txt

stevee@AMDA8 ~/Documents \$ ls -alsh testfile.txt

4.0K -rw-r–r– 1 stevee stevee 8 Oct 6 18:57 testfile.txt

Sure enough above, the IO block size of 4k remains unfilled, but the GUI shows 8 bytes after the addition of another 7 characters. Seems each ASCII character is a byte in size!

Not convinced?

rm -v testfile.txt
removed 'testfile.txt'
for x in {1..4096}; do echo -n A >> testfile.txt; done
ls -als testfile.txt
4 -rw-r–r– 1 stevee stevee 4096 Aug 8 11:29 testfile.txt

The programs man, od and hexdump can output characters in octal/hex/denary bytes, and much more if you want to see what octet numbers relate to ascii characters etc.

-b One-byte octal display. Display the input offset in hexadecimal,

followed by sixteen space-separated, three column, zero-filled,

bytes of input data, in octal, per line.

-c One-byte character display. Display the input offset in hexadec‐

imal, followed by sixteen space-separated, three column, space-

filled, characters of input data per line.

-C Canonical hex+ASCII display. Display the input offset in hexa‐

decimal, followed by sixteen space-separated, two column, hexa‐

decimal bytes, followed by the same sixteen bytes in %_p format

enclosed in |" characters.

Calling the command hd implies this option.

-d Two-byte decimal display. Display the input offset in hexadeci‐

mal, followed by eight space-separated, five column, zero-filled,

two-byte units of input data, in unsigned decimal, per line.

So now I know that a single ASCII non-control character in this file is only 1 byte in size:

stevee@Mint5630 ~ \$hexdump -bcCd testfile.txt

You can see that "A" has the same numerical value = 41 or (4 x16 = 64 +1) = 65 in base 10; = 101 (8×8+0+1 = 65) in octal; if you do the maths.

stevee@AMDA8 ~/Desktop \$ od -b testfile.txt
0000000 101 102 103 104 105 106 107 110
0000010

The actual contents of the original file with just the A in it is 3 bytes it seems – A Null, the A itself, and a Start Of Heading control character, 0000001. As the other 2 are control characters they don't show in the nemo window size column, just the A does, as 1 byte.

stevee@AMDA8 ~/Desktop \$ hd -b testfile.txt
00000000 41 |A|
0000000 101
0000001

That may be interesting to research as why the other non-printable, control characters of only 1 bit in size each, don't also register a byte length allocation in the GUI?

The first – all 0s would make sense not to, as all 0s don't take up "registered" disk space as far as I know, but the Start Of Heading character is a 1?

## ASCII control characters (character code 0-31)

The first 32 characters in the ASCII-table are unprintable control codes and are used to control peripherals such as printers.

 DEC OCT HEX BIN Symbol HTML Number HTML Name Description 0 000 00 00000000 NUL � Null char 1 001 01 00000001 SOH  Start of Heading

What system defines alphabetic or other "characters" that are used in an allocated file "space"?

To answer that adequately, a history lesson is required, as there are also many other aspects of today's IT systems, such as security, privacy, efficiency, statistical analysis, overhead etc. that were also present, with historical hindsight – if the implications were not fully realised or intended at inception – in more primitive forms of electrical telecoms.

Historically there have been many coded transmission systems, and their origins lie in older forms of "networking" as we know it now, or "signals transmission" of one sort of another, that were around long before electricity was discovered – smoke signals, drums, bonfire beacons, flag semaphore etc.

These are self explanatory for the most part, and may just have signified an "event" of some sort, but the important thing to note is that the event or non-event represents a form of binary signaling – if x happens, light fire!; if not, don't!

Jumping to electrical signals, the Morse Code superseded the forms of visual (and public) long distance semaphores, to transmit pulses in a circuit, (which implies a basic level of security as only the sender and receiver know the message).

This system comprised a series of time dependent pulses – dots and dashes – where the duration of a dash equalled 3 dots, and the time between each pulse also equalled a dot duration..

If viewed as an encapsulation method, Morse is interesting as it combines elements of a minimum "element" size (a dot or space) and a maximum "elemental unit combination" size of 5 dashes separated by 4 dots for the longest character, "0":

If this was seen as time pulses, then character 0 would comprise (5×3 dots) + (4 dots) = 19 dot lengths.

All the number and letter characters could be contained within a larger "encapsulated unit" of 19 pulses time duration – or any other term you may like to call them – maybe envelope? Parcel? Packet? Datagram? Frame..?Container?

This would be "structurally inefficient" but "practically efficient" in todays' technological terms, as the actual "Morse code design" had already been optimised for transmission speed (efficiency) by humans, using statistical analysis – the reason the "E" is the shortest duration of 1 dot length – because it is the most common letter in most English language words, so the quickest to transmit by hand.

This "statistically optimised" design would be compromised if all the characters were "encapsulated" in a framework of a fixed length of 19 pulses. This would of course be irrelevant if transmitted automatically, using a 19 dot length "packet", by computer system that could transmit a dot length more than about 20 times faster than a human.

Studying systems like this, it can be seen that it has been realised that there has to be a minimum 3 dot length separation between characters to avoid confusion between the characters that end with stream of dots, and those that start with dots. The same logical separation exists for words, with a gap of 7 dot lengths, which is the same duration as a letter D, H or M, but made of silence.

If the minimum dot length in Morse is seen as an equivalent of a 1 or 0 in today's binary systems, it can be seen that binary is more efficient in describing the characters if each is "encapsulated" within a byte length, because the 26 letters of the English alphabet could be ascribed numbers 1-26 in binary, which takes only 5 binary bits to range from 1-26, which is [00001 – 11010] in binary. The maximum character dot length in Morse is 19 "time pulses" long, seen above. One octet of binary only requires 8, and depending on how you defined a "byte", then 5 bits would suffice to cater for all 26 alphabet letters A-Z.

Even if the numbers 0-9 are included, this gives 36 characters, that can be described in binary with only 1 more digit, the 2^5, or 32 column, for 6 bit length characters.

One of the earliest systems to use a 5 bit "byte" system was the Baudot Code which could be used for telegraphic transmission and is still an unused standard today:

"Baudot's code became known as International Telegraph Alphabet No. 1, and is no longer used."

Baudot code developed from Bacon's Cipher:

which can be seen as an equivalent of binary bytes, if char A represents binary 0, and char B represents binary 1. As "u=v"and i=j, all letters are described by 24 x 5 character "digits" as above, from 00000 (a) to 10111 (z):

a AAAAA g AABBA n ABBAA t BAABA

b AAAAB h AABBB o ABBAB u-v BAABB

c AAABA i-j ABAAA p ABBBA w BABAA

d AAABB k ABAAB q ABBBB x BABAB

e AABAA l ABABA r BAAAA y BABBA

f AABAB m ABABB s BAAAB z BABBB

The encapsulation is in the form of 5 elements per character block, and can be translated to a present day equivalent of a time dependent bit stream of square wave clock pulses, where each pulse represents a possible 0 or 1 bit value:

where a peak represents a B, or 1, and a trough is an A or 0.

Each character block is then delimited and so defined by an external reference clock source, that relentlessly beats time at exact intervals between each character's only allowed existence period of 5 pulses, at an amplitude of 0 or 1 only per pulse, with a counter system of some sort marking each contiguous block of 5 time periods. In a similar fashion to the Morse letter and word separator, there would need to be a period and/or pattern of pulses that represented a gap between character types, whose binary value was not equal to any of the letter or number characters.

The rule set defining all these various parameters that may be required in a functional transmission system – bits per byte, clock pulse source and pulse counter, pulse duration in seconds, delimiter byte value, start of message byte value or pattern, end of message byte value or pattern, error detection methods, re-transmission of lost bytes, methods of synchronisation at each end, limits of technology – and overall method of operation is called a "protocol".

It is only when spelled out in those terms that you realise how much is present but taken for granted in the control system of Morse Code, when humans are doing it, naturally using our own language "protocols" to determine aspects like "message start and end", "garbled message", negotiated speed of transmission and ettiquette, that you think how complicated it all is to translate for a machine to do it.

That form of counting regularly spaced clock pulses it a "time dependent" coding, and all bits are organised so they fall in step – either singularly or in defined blocks – with a Master Clock of some form.

Using a very fast electronic clock source enabled a particularly efficient method of sending many signals seemingly "simultaneously", called Time Domain Multiplexing – TDM.

This is a method of combining multiple "conversations" or messages along the same media (copper wire, optical fibre, wifi etc.) by interspersing small "chunks" of each message in a small time space, so sharing the media and clock source "time space", using it to transmit – seemingly instantaneously – all the separate channel's messages "at once" if done quickly enough, compared to a human's rate of perception.

The graphic example below could be seen as a way of constructing and deconstructing the 5 binary digit "bytes" of the Baudot Code above, where each consecutive bit of each Baudot character byte is from a different message stream from 5 different sources.

This shows that it would take 5 clock pulses to send the first bit of each of the 5 separate messages, before cycling round to send the 2nd bit, then the 3rd and so on. It would take 21 pulses for the first 5 bit byte to be re-assembled by the receiver to decipher the first Baudot letter/number character for conversation A, and 25 pulses before all conversations had received the first character of their respective message.

This is a particularly "democratic" method of sharing available hardware and technology, so all receive their messages in a more "fair" and timely fashion without "E" having to wait, last in the queue, until all the others had sent theirs first.

More importantly, it means there does not have to be the cost and engineering involved in creating 5 separate links, 5 encoders, 5 decoders and 5 lots of other associated resources to be built and installed to cover the same geographic area. A MASSIVE construction cost advantage, the larger the scale and distances involved.

Think current, state of the art, muti-billion dollar, but single cable, optical subsea systems that link continents to appreciate the real benefits of multiplexing – and so – encapsulation!

Multiplexing and de multiplexing has been the cornerstone of digital telegraphy/telephony (the Internet) since it became publicly available around the 1980s, with the first digital network, TCP/IP precursor to the modern Internet – DARPANET – sending its first message in 1969.

This network was designed around the concept of "packet switching", where packets of data – predefined collections of encapsulated bytes that represent a message of some type – are transmitted separately across the telephone network, but can take different phone circuit routes as available and required, before being re-assembled in the correct order that they were sent, to re-create the original message.

Initially, the bulk of the data sent over the existing analogue audio frequency telephone network was digitised telephone conversations, using the Integrated Services Digital Network ISDN protocol, decribed in the older CISCO CCNA course, which covered ISDN in some depth, before it became more or less obsolete and replaced by DSL of various types, but describes examples of ISDN encapsulation design:

"There are two ISDN services: BRI and PRI. The ISDN BRI service offers two 8-bit B channels and one 2-bit D channel, often referred to as 2B+D, as shown in the figure. ISDN BRI delivers a total bandwidth of a 144-kbps line into three separate channels (8000 frames per second * (2*8-bit B channel+2 bit D channel)=8000*18 = 144kbps).BRI B channel service operates at 64 kbps (8000 frames per second* 8-bit B channel) and is meant to carry user data and voice traffic.

ISDN provides great flexibility to the network designer because of its ability to use each of the B channels for separate voice and/or data applications. For example, a long document can be downloaded from the corporate network over one ISDN 64-kbps B channel while the other B channel is being used to connect to browse a Web page.

The third channel, the D channel, is a 16-kbps (8000 frames per second * 2 bit D channel) signaling channel used to carry instructions that tell the telephone network how to handle each of the B channels. BRI D channel service operates at 16 kbps and is meant to carry control and signaling information, although it can support user data transmission under certain circumstances. The D channel signaling protocol occurs at Layers 1 through 3 of the OSI reference model.

ISDN physical-layer (Layer 1) frame formats differ depending on whether the frame is outbound (from terminal to network-the TE frame format) or inbound (from network to terminal-the NT frame format). Both of the frames are 48 bits long, of which 36 bits represent data. Actually, the frames are two 24 bit frames in succession consisting of 2 8-bit B channels, a 2-bit D channel, and 6 bits of framing information (2*(2*8B+2D+6F) = 32B+4D+12F = 36BD+12F = 48BDF). Both physical-layer frame formats are shown in the figure. The bits of an ISDN physical-layer frame are used as follows:

• Framing bit – Provides synchronization.
• Echo of previous D channel bits-Used for contention resolution when several terminals on a passive bus contend for a channel.
• Activation bit – Activates devices.
• Spare bit – Unassigned.
• B1 channel bits.
• B2 channel bits.
• 8 Added channel bit counts bits.
• D channel bits – Used for user data.

Note that each of the ISDN BRI frames are sent at a rate of 8000 per second. There are 24 bits in each frame (2*8B+2D+6F = 24) for a bit rate of 8000*24 = 192Kbps. The effective rate is 8000*(2*8B+2D) = 8000*18 = 144Kbps.

Multiple ISDN user devices can be physically attached to one circuit. In this configuration, collisions can result if two terminals transmit simultaneously. ISDN therefore provides features to determine link contention. These features are part of the ISDN D channel, which is described in more detail later in this chapter."

You can see now how the complexity of principles build, with both encapsulation and multiplexing combining to create an architecture of bits per second, encapsulated into 24 bit frames sent over 3 multiplexed channels, at a rate of 8000 frames per second at a clock speed of 8000 x 24 = 192kbs.

The ISDN system's protocols clock speed units became the base multiples for present day multiplexed optical fibre SONET and SDH "transport protocol's" data "containers" which have grown in size with each increase in electronic technology speed to today's Dense Wave Division Multiplexing (DWDM) optical fibre, top level international trunk links being built up from the 2, slightly different, base fibre transmission system's rates below.

Each system adds its own overhead for management functions when multiplexed at higher levels, so upward multiplication from ISDN's slower clock speeds mutiplexing is not immediately apparent, as very many channels can now be multiplexed together according to protocol, along with "interleaved" management data stuctures, making a more complex multiplexed transmission system at very high bit rates:

"ISDN PRI service offers 23 8-bit channels and 1 8-bit D channel plus 1 framing bit in North America and Japan, yielding a total bit rate of 1.544 Mbps (8000 frames per second * (23 * 8-bit B channels + 8-bit D channel + 1 bit framing) = 8000*8*24.125 = 1.544 Mbps) (the PRI D channel runs at 64 kbps). ISDN PRI in Europe, Australia, and other parts of the world provides 30 8-bit B channels plus one 8-bit D channel plus one 8-bit Framing channel, for a total interface rate of 2.048 Mbps (8000 frames per second* (30*8-bit B channels + 8-bit D channel + 8-bit Framing channel = 8000*8*32 =2.048 Mbps)."

So you can see how the relatively simple concept of on/off or 1/0 or yes/no logic can build into more complex yet ordered structures enabling the multiplexing, transmission, reception and "demuxing" of massive amounts of data that now flow at superfast bit rates around the globe in seconds, every day, ceaselessly, as there is (almost!) always an alternative path available somewhere should a physical link be unavailable.

Below is an Optical Spectrum Analyser view of only 8 test laser carrier channels of the possible hundreds that can now be sent down a single optical fibre, each carrying data pulses at gigabits per second rates in total.

Taken when working on the Greenland Connect project connecting Clarenville, Newfoundland to Nuuk, Greenland in 2008). Common timing sources between inter-continental cable stations is by atomic clock and/or GPS signals.

The word protocol has been used a lot, so what does my Mint system come aware of for network related protocols?

getent protocols | wc -l
55

getent protocols

ip 0 IP
hopopt 0 HOPOPT
icmp 1 ICMP
igmp 2 IGMP
ggp 3 GGP
ipencap 4 IP-ENCAP
st 5 ST
tcp 6 TCP
egp 8 EGP
igp 9 IGP
pup 12 PUP
udp 17 UDP
hmp 20 HMP
xns-idp 22 XNS-IDP
rdp 27 RDP
iso-tp4 29 ISO-TP4
dccp 33 DCCP
xtp 36 XTP
ddp 37 DDP
idpr-cmtp 38 IDPR-CMTP
ipv6 41 IPv6
ipv6-route 43 IPv6-Route
ipv6-frag 44 IPv6-Frag
idrp 45 IDRP
rsvp 46 RSVP
gre 47 GRE
esp 50 IPSEC-ESP
ah 51 IPSEC-AH
skip 57 SKIP
ipv6-icmp 58 IPv6-ICMP
ipv6-nonxt 59 IPv6-NoNxt
ipv6-opts 60 IPv6-Opts
rspf 73 RSPF CPHB
vmtp 81 VMTP
eigrp 88 EIGRP
ospf 89 OSPFIGP
ax.25 93 AX.25
ipip 94 IPIP
etherip 97 ETHERIP
encap 98 ENCAP
pim 103 PIM
ipcomp 108 IPCOMP
vrrp 112 VRRP
l2tp 115 L2TP
isis 124 ISIS
sctp 132 SCTP
fc 133 FC
udplite 136 UDPLite
mpls-in-ip 137 MPLS-in-IP
manet 138
hip 139 HIP
shim6 140 Shim6
wesp 141 WESP
rohc 142 ROHC

I recognise some from the CISCO CCNA course, IPSEC, EIGRP, L2TP, GRE, MPLS – mainly backbone router links related.

Returning to the concept of text and numbers being the essential prerequisite for encoding and so sending messages, how do computers manage standards for these and other control characters today? To aid understanding, a good example you can run through pasting commands, for one main International Standard, is in William Shott's excellent linux guide on p274 of TLCL.PDF. You see that ASCII initially used an 7 bit code – binary digits 0-127, or 2^6 (as 2^0 = 1).

"POSIX Character Classes
The traditional character ranges are an easily understood and effective way to handle the
problem of quickly specifying sets of characters. Unfortunately, they don't always work.
While we have not encountered any problems with our use of grep so far, we might run
into problems using other programs.
Back in Chapter 4, we looked at how wildcards are used to perform pathname expansion.
In that discussion, we said that character ranges could be used in a manner almost identi-
cal to the way they are used in regular expressions, but here's the problem:

[me@linuxbox ~]\$ ls /usr/sbin/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]*

/usr/sbin/MAKEFLOPPIES
/usr/sbin/NetworkManagerDispatcher
/usr/sbin/NetworkManager

(Depending on the Linux distribution, we will get a different list of files, possibly an
empty list. This example is from Ubuntu). This command produces the expected result—

a list of only the files whose names begin with an uppercase letter, but:
250Bracket Expressions And Character Classes
[me@linuxbox ~]\$ ls /usr/sbin/[A-Z]*

/usr/sbin/biosdecode
/usr/sbin/chat
/usr/sbin/chgpasswd
/usr/sbin/chpasswd
/usr/sbin/chroot
/usr/sbin/cleanup-info
/usr/sbin/complain
/usr/sbin/console-kit-daemon

with this command we get an entirely different result (only a partial listing of the results
is shown). Why is that? It's a long story, but here's the short version:

Back when Unix was first developed, it only knew about ASCII characters, and this fea-
ture reflects that fact.

In ASCII, the first 32 characters (numbers 0-31) are control codes
(things like tabs, backspaces, and carriage returns).

The next 32 (32-63) contain printable
characters, including most punctuation characters and the numerals zero through nine.

The next 32 (numbers 64-95) contain the uppercase letters and a few more punctuation
symbols.

The final 31 (numbers 96-127) contain the lowercase letters and yet more punc-
tuation symbols. Based on this arrangement, systems using ASCII used a collation order
that looked like this:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

This differs from proper dictionary order, which is like this:

As the popularity of Unix spread beyond the United States, there grew a need to support
characters not found in U.S. English. The ASCII table was expanded to use a full eight
bits, adding characters numbers 128-255, which accommodated many more languages.

To support this ability, the POSIX standards introduced a concept called a locale, which
could be adjusted to select the character set needed for a particular location. We can see
the language setting of our system using this command:

[me@linuxbox ~]\$ echo \$LANG

en_US.UTF-8

With this setting, POSIX compliant applications will use a dictionary collation order
rather than ASCII order. This explains the behavior of the commands above. A character
range of [A-Z] when interpreted in dictionary order includes all of the alphabetic char-
acters except the lowercase "a", hence our results.

To partially work around this problem, the POSIX standard includes a number of charac-
ter classes which provide useful ranges of characters. They are described in the table be-low:
Table 19-2: POSIX Character Classes
Character Class Description
[:alnum:] The alphanumeric characters. In ASCII, equivalent to:
[A-Za-z0-9]
[:word:] The same as [:alnum:], with the addition of the underscore
(_) character.
[:alpha:] The alphabetic characters. In ASCII, equivalent to:
[A-Za-z]
[:blank:] Includes the space and tab characters.
[:cntrl:] The ASCII control codes. Includes the ASCII characters 0
through 31 and 127.
[:digit:] The numerals zero through nine.
[:graph:] The visible characters. In ASCII, it includes characters 33
through 126.
[:lower:] The lowercase letters.
[:punct:] The punctuation characters. In ASCII, equivalent to:
[-!"#\$%&'()*+,./:;<=>?@[\\\]_`{|}~]
[:print:] The printable characters. All the characters in [:graph:]
plus the space character.
[:space:] The whitespace characters including space, tab, carriage
return, newline, vertical tab, and form feed. In ASCII,
equivalent to:
[ \t\r\n\v\f]
[:upper:] The uppercase characters.
[:xdigit:] Characters used to express hexadecimal numbers. In ASCII,
equivalent to:
[0-9A-Fa-f]
Even with the character classes, there is still no convenient way to express partial ranges,
such as [A-M].
Using character classes, we can repeat our directory listing and see an improved result:
252Bracket Expressions And Character Classes
[me@linuxbox ~]\$ ls /usr/sbin/[[:upper:]]*

/usr/sbin/MAKEFLOPPIES
/usr/sbin/NetworkManagerDispatcher
/usr/sbin/NetworkManager

Remember, however, that this is not an example of a regular expression, rather it is the
shell performing pathname expansion. We show it here because POSIX character classes
can be used for both.

You can opt to have your system use the traditional (ASCII) collation order by
changing the value of the LANG environment variable. As we saw above, the
LANG variable contains the name of the language and character set used in your
locale. This value was originally determined when you selected an installation
language as your Linux was installed.
To see the locale settings, use the locale command:

[me@linuxbox ~]\$ locale

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

To change the locale to use the traditional Unix behaviors, set the LANG variable
to POSIX:

[me@linuxbox ~]\$ export LANG=POSIX

Note that this change converts the system to use U.S. English (more specifically,
ASCII) for its character set, so be sure if this is really what you want.
You can make this change permanent by adding this line to you your .bashrc
file:

export LANG=POSIX"

Now do you see why cutting and pasting text between systems and Apps depends on both sides using the same character maps, as each character IS a specific binary number dependent on the map used. This is NOT what gave the problems I wrote angrily about in the WordPress "double minus sign" Post though, even if both linux and WordPress are set to UTF8:

Fix Double Minus Sign Problem in WordPress in Theme Functions php

I assume the WordPress "texturize" coding "amends" pasted characters to a different code (at the thoughtless discretion of the WP staff) so they look prettier by trying to "guess" what a writer meant to put…despite that lack of foresight regarding the effect on any programme code involved…