Data Universe Logo

Data Universe™ Glossary

Sponsored By

BKMcM.com Logo

Glossary


Block - A block is a variable-length sequence of binary bytes. Blocks may contain from 5 to 65,505 bytes. Blocks come in five types based on their content: User Data (UD), File Description (FD), Directory List (DL), Host List (HL), and Query (QU). The first 5 bytes are reserved for a block signature. The maximum length is selected to ensure that a Block ID and the Block itself can always be transported in a single IP Frame.

Block ID - A Block ID is a sequence of from 28 to 30 printable characters. The Block ID is the printable representation of a L-Hash descriptor of the data contained within the block. Even a zero-length block will have a five character signature so there will always be at least 28 characters in a Block ID.

Block Signature - Five characters at the beginning of each data block that identify the type of data in the block. The first character is literal '#', the second pair identify the block type (either 'UD', 'FD', 'DL', 'HL', or 'QU') and the third pair identify the compression algorithm (either 'RD', or 'LZ'). The '#' is chosen to delimit the end of the variable length Block ID pre-pended in the data communication channel.

DateTime - standard printable ASCII form of the creation date and time of a file. Used to allow recreation of more complete directory entries for files extracted from the Universe. The value contains 14 decimal digits in the format: yyyymmddhhnnss.

File ID - A File ID is a sequence of at least 27 printable characters. The File ID is the printable representation of a LHash descriptor of the data within the file. File IDs are used to consolidate different File Names and/or File Descriptions that describe the same content. File IDs are also used to ensure the integrity of reconstructed multi-Block files.

Host Computer - A Host is a computer that participates in the Data Universe by running the Universe kernel application. A host also makes available resources that include CPU time, Storage space, a TCP/IP socket and a certain amount of Bandwidth to the Internet.

Host ID - A Host ID in the current implementation is a URL containing an IP address and port number in printable ASCII text. See RFC 2732 for formats of literal IP addresses.

L-Hash - L-Hash is the algorithm used to create printable Block IDs and File IDs. The printable form uses 64 characters from the set ['0 '..'9 ','A'..'Z','a'..'z','$','%'] to represent 6-bit values. Leading ASCII zero characters are suppressed to create the variable-length printable form. The recommended algorithm is L-SHA1. This implies that Block IDs and File IDs will be 27 or more printable characters in length.

L-MD5 - A modification of the RFC1321 MD5 message digest algorithm in which the input data length in bits is prepended onto the 128-bit message digest value.

L-SHA1 - A modification of the RFC3174 SHA1 Secure Hash algorithm in which the input data length in bits is prepended onto the 160-bit message digest value.

Query - A type of data Block that contains instructions for searching the Repository of one or more Hosts and returning the results. The Data Universe Idle Process scans the Repository looking for Query Blocks. As they are found, they are processed and disposed of either by (1) returning results, (2) forwarding to another Host, or (3) discarding. A list of recent Queries prevents the same Query from running more than once on a given Host.

Repository - The storage area on a Host computer that contains data Blocks. A configuration parameter allows the administrator of each Host to limit the size of the Repository. Simple implementations may store each block in a separate disk file using the signature and L-Hash as the file name. (Windows implementations may be restricted by their inability to differentiate upper- and lower-case filenames.)

Slicing - A general term that means breaking up an arbitrarily large data file into a set of one or more User Data (UD) Blocks. Typically, the first step is to run a compression algorithm. The results are then divided into Blocks that do not exceed 65,500 bytes. The size of the blocks and whether they contain consecutive data (or are broken into stripes) are arbitrary decisions made at the time the file is entered into the Universe. Additional blocks may be created to implement error correction logic. These may be simple parity-based blocks or they may incorporate Reed-Solomon Forward Error Correction coding. The goal is to allow complete and accurate reassembly of the original file, even in the absence of all of the data blocks. The reconstruction instructions (including the list of Block IDs, slicing/striping/ECC, and decompression algorithm) are included in the File Description (FD) Block.

Timestamp - standard printable ASCII form for time-of-day values used in File Descriptions, Directory Lists, Host Lists and Queries. The value contains exactly 12 decimal digits representing Universal Time in the format: yymmddhhnnss. Since the timestamp is predominately used for expiration times and sorting, simple string comparisons will suffice in most instances.




Contact Us

BKMcM.com • 13261 Ridgepointe Rd. • Keller, TX 76248

Phone 214-232-3198


Get Firefox!   Valid CSS!   Valid HTML 4.01!
Created on ... February 18, 2005
© Copyright 2005 Brian McMillin