Resources, Assignments and Clarifications.....: UNIT 5 Lecture Notes

HASHING
Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string. Hashing is used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value. It is also used in many encryption algorithms.
As a simple example of the using of hashing in databases, a group of people could be arranged in a database like this:
Abernathy, Sara Epperdingle, Roscoe Moore, Wilfred Smith, David (and many more sorted into alphabetical order)
Each of these names would be the key in the database for that person's data. A database search mechanism would first have to start looking character-by-character across the name for matches until it found the match (or ruled the other entries out). But if each of the names were hashed, it might be possible (depending on the number of names in the database) to generate a unique four-digit key for each name. For example:
7864 Abernathy, Sara 9802 Epperdingle, Roscoe 1990 Moore, Wilfred 8822 Smith, David (and so forth)
A search for any name would first consist of computing the hash value (using the same hash function used to store the item) and then comparing for a match using that value. It would, in general, be much faster to find a match across four digits, each having only 10 possibilities, than across an unpredictable value length where each character had 26 possibilities.
The hashing algorithm is called the hash function (and probably the term is derived from the idea that the resulting hash value can be thought of as a "mixed up" version of the represented value). In addition to faster data retrieval, hashing is also used to encrypt and decrypt digital signatures (used to authenticate message senders and receivers). The digital signature is transformed with the hash function and then both the hashed value (known as a message-digest) and the signature are sent in separate transmissions to the receiver. Using the same hash function as the sender, the receiver derives a message-digest from the signature and compares it with the message-digest it also received. They should be the same.
The hash function is used to index the original value or key and then used later each time the data associated with the value or key is to be retrieved. Thus, hashing is always a one-way operation. There's no need to "reverse engineer" the hash function by analyzing the hashed values. In fact, the ideal hash function can't be derived by such analysis. A good hash function also should not produce the same hash value from two different inputs. If it does, this is known as a collision. A hash function that offers an extremely low risk of collision may be considered acceptable.
Here are some relatively simple hash functions that have been used:
· The division-remainder method: The size of the number of items in the table is estimated. That number is then used as a divisor into each original value or key to extract a quotient and a remainder. The remainder is the hashed value. (Since this method is liable to produce a number of collisions, any search mechanism would have to be able to recognize a collision and offer an alternate search mechanism.)
· Folding: This method divides the original value (digits in this case) into several parts, adds the parts together, and then uses the last four digits (or some other arbitrary number of digits that will work ) as the hashed value or key.
· Radix transformation: Where the value or key is digital, the number base (or radix) can be changed resulting in a different sequence of digits. (For example, a decimal numbered key could be transformed into a hexadecimal numbered key.) High-order digits could be discarded to fit a hash value of uniform length.
· Digit rearrangement: This is simply taking part of the original value or key such as digits in positions 3 through 6, reversing their order, and then using that sequence of digits as the hash value or key.
A hash function that works well for database storage and retrieval might not work as for cryptographic or error-checking purposes. There are several well-known hash functions used in cryptography.

HASH FUNCTION
a hash table, or a hash map, is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number). It works by transforming the key using a hash function into a hash, a number that is used as an index in an array to locate the desired location ("bucket") where the values should be.
Hash tables support the efficient insertion of new entries, and the time spent searching for the required data is often independent of the number of items stored on average.
Hash tables are often used to implement associative arrays, sets and caches. Like arrays, hash tables provide constant-time O(1) lookup on average, regardless of the number of items in the table. While theoretically the worst-case lookup time can be as bad as O(n), this is for practical purposes statistically impossible unless the hash function is poorly designed or unless the set of keys is maliciously chosen with the given hash function in mind.
Compared to other associative array data structures, hash tables are most useful when large numbers of records are to be stored, especially if the size of the data set can be predicted.
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indexes sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.

Choosing a good hash function
A good hash function is essential for good hash table performance. A poor choice of a hash function is likely to lead to clustering, in which probability of keys mapping to the same hash bucket (i.e. a collision) is significantly greater than would be expected from a random function. A nonzero collision probability is inevitable in any hash implementation, but usually the number of operations required to resolve a collision scales linearly with the number of keys mapping to the same bucket, so excess collisions will degrade performance significantly. In addition, some hash functions are computationally expensive, so the amount of time (and, in some cases, memory) taken to compute the hash may be burdensome.
Collision resolution
If two keys hash to the same index, the corresponding records cannot be stored in the same location. So, if it's already occupied, we must find another location to store the new record, and do it so that we can find it when we look it up later on. This is known as collision resolution.
There are a number of collision resolution techniques, but the most popular are chaining and open addressing.
Separate chaining
Sometimes called simply chaining, this technique in its simplest form has a linked list of inserted records at each slot in the array references. Each linked list has each element that collides to the same slot. Insertion requires finding the correct slot, and appending to either end of the list in that slot; deletion requires searching the list and removal.
Chained hash tables have advantages over open addressed hash tables in that the removal operation is simple and resizing the table can be postponed for a much longer time because performance degrades more gracefully even when every slot is used. Indeed, many chaining hash tables may not require resizing at all since performance degradation is linear as the table fills. For example, a chaining hash table containing twice its recommended capacity of data would only be about twice as slow on average as the same table at its recommended capacity.
Chained hash tables inherit the disadvantages of linked lists. When storing small records, the overhead of the linked list can be significant. An additional disadvantage is that traversing a linked list has poor cache performance.
Alternative data structures can be used for chains instead of linked lists. By using a self-balancing tree, for example, the theoretical worst-case time of a hash table can be brought down to O(log n) rather than O(n). However, since each list is intended to be short, this approach is usually inefficient unless the hash table is designed to run at full capacity or there are unusually high collision rates, as might occur in input designed to cause collisions. Dynamic arrays can also be used to decrease space overhead and improve cache performance when records are small.
Some chaining implementations use an optimization where the first record of each chain is stored in the table. The purpose is to increase cache efficiency of hash table access.

Open Addressing
Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method a hash collision is resolved by probing, or searching through alternate locations in the array (the probe sequence) until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table. Well known probe sequences include:
linear probing
in which the interval between probes is fixed--often at 1.
quadratic probing
in which the interval between probes increases linearly (hence, the indices are described by a quadratic function).
double hashing
in which the interval between probes is fixed for each record but is computed by another hash function.
The main tradeoffs between these methods are that linear probing has the best cache performance but is most sensitive to clustering, while double hashing has poor cache performance but exhibits virtually no clustering; quadratic probing falls in-between in both areas. Double hashing can also require more computation than other forms of probing. Some open addressing methods, such as last-come-first-served hashing and cuckoo hashing move existing keys around in the array to make room for the new key. This gives better maximum search times than the methods based on probing.
A critical influence on performance of an open addressing hash table is the load factor; that is, the proportion of the slots in the array that are used. As the load factor increases towards 100%, the number of probes that may be required to find or insert a given key rises dramatically. Once the table becomes full, probing algorithms may even fail to terminate. Even with good hash functions, load factors are normally limited to 80%. A poor hash function can exhibit poor performance even at very low load factors by generating significant clustering. What causes hash functions to cluster is not well understood, and it is easy to unintentionally write a hash function which causes severe clustering.

Open addressing versus chaining
Chained hash tables have the following benefits over open addressing:
· They are simple to implement effectively and only require basic data structures.
· From the point of view of writing suitable hash functions, chained hash tables are insensitive to clustering, only requiring minimization of collisions. Open addressing depends upon better hash functions to avoid clustering. This is particularly important if novice programmers can add their own hash functions, but even experienced programmers can be caught out by unexpected clustering effects.
· They degrade in performance more gracefully. Although chains grow longer as the table fills, a chained hash table cannot "fill up" and does not exhibit the sudden increases in lookup times that occur in a near-full table with open addressing. (see right)
· If the hash table stores large records, about 5 or more words per record, chaining uses less memory than open addressing.
· If the hash table is sparse (that is, it has a big array with many free array slots), chaining uses less memory than open addressing even for small records of 2 to 4 words per record due to its external storage.
For small record sizes (a few words or less) the benefits of open addressing compared to chaining are:
· They can be more space-efficient than chaining since they don't need to store any pointers or allocate any additional space outside the hash table. Simple linked lists require a word of overhead per element.
· Insertions avoid the time overhead of memory allocation, and can even be implemented in the absence of a memory allocator.
· Because it uses internal storage, open addressing avoids the extra indirection required for chaining's external storage. It also has better locality of reference, particularly with linear probing. With small record sizes, these factors can yield better performance than chaining, particularly for lookups.
· They can be easier to serialize, because they don't use pointers.
On the other hand, normal open addressing is a poor choice for large elements, since these elements fill entire cache lines (negating the cache advantage), and a large amount of space is wasted on large empty table slots. If the open addressing table only stores references to elements (external storage), it uses space comparable to chaining even for large records but loses its speed advantage.
Generally speaking, open addressing is better used for hash tables with small records that can be stored within the table (internal storage) and fit in a cache line. They are particularly suitable for elements of one word or less. In cases where the tables are expected to have high load factors, the records are large, or the data is variable-sized, chained hash tables often perform as well or better.
Ultimately, used sensibly any kind of hash table algorithm is usually fast enough; and the percentage of a calculation spent in hash table code is low. Memory usage is rarely considered excessive. Therefore, in most cases the differences between these algorithms is marginal, and other considerations typically come into play.

GARBAGE COLLECTION
Garbage collection (GC) is a form of automatic memory management. The garbage collector, or just collector, attempts to reclaim garbage, or memory used by objects that will never be accessed or mutated again by the application. Garbage collection was invented by John McCarthy around 1959 to solve the problems of manual memory management in his Lisp programming language.
Garbage collection is often portrayed as the opposite of manual memory management, which requires the programmer to specify which objects to deallocate and return to the memory system. However, many systems use a combination of the two approaches, and there are other techniques being studied to solve the same fundamental problem.
The basic principle of how a garbage collector works is:
1. Determine what data objects in a program will not be accessed in the future
2. Reclaim the resources used by those objects
By making manual memory deallocation unnecessary (and typically forbidding it), garbage collection frees the programmer from having to worry about releasing objects that are no longer needed, which can otherwise consume a significant amount of design effort. It also aids programmers in their efforts to make programs more stable, because it prevents several classes of runtime errors. For example, it prevents dangling pointer errors, where a reference to a deallocated object is used. (The pointer still points to the location in memory where the object or data was, even though the object or data has since been deleted and the memory may now be used for other purposes, creating a dangling pointer.) Many computer languages require garbage collection, either as part of the language specification (e.g. C#, and most scripting languages) or effectively for practical implementation (e.g. formal languages like lambda calculus); these are said to be garbage-collected languages. Other languages were designed for use with manual memory management, but have garbage collected implementations (e.g., C, C++).
Benefits
Garbage collection frees the programmer from manually dealing with memory allocation and deallocation. As a result, certain categories of bugs are eliminated or substantially reduced:
· Dangling pointer bugs, which occur when a piece of memory is freed while there are still pointers to it, and one of those pointers is used.
· Double free bugs, which occur when the program attempts to free a region of memory that is already free.
· Certain kinds of memory leaks, in which a program fails to free memory that is no longer referenced by any variable, leading over time to memory exhaustion.
Researchers draw a distinction between "physical" and "logical" memory leaks. In a physical memory leak, the last pointer to a region of allocated memory is removed, but the memory is not freed. In a logical memory leak, a region of memory is still referenced by a pointer, but is never actually used. Garbage collectors generally can do nothing about logical memory leaks. Novice programmers sometimes believe that garbage collection makes memory leaks impossible, not realizing that logical leaks are still possible.

Memory management
Memory management is the act of managing computer memory. In its simpler forms, this involves providing ways to allocate portions of memory to programs at their request, and freeing it for reuse when no longer needed.
Virtual memory systems separate the memory addresses used by a process from actual physical addresses, allowing separation of processes and increasing the effectively available amount of RAM using disk swapping. The quality of the virtual memory manager can have a big impact on overall system performance.
Garbage collection is the automated allocation, and deallocation of computer memory resources for a program. This is generally implemented at the programming language level and is in opposition to manual memory management, the explicit allocation and deallocation of computer memory resources.
Memory management systems usually deal with the following issues.
Relocation : In systems with virtual memory, programs in memory must be able to reside in different parts of the memory at different times. This is because when the program is swapped back into memory after being swapped out for a while it can not always be placed in the same location. Memory management in the operating system should therefore be able to relocate programs in memory and handle memory references in the code of the program so that they always point to the right location in memory.
Protection:
Processes should not be able to reference the memory for another process without permission. This is called memory protection, and prevents malicious or malfunctioning code in one program from interfering with the operation of other running programs.
Sharing
Even though the memory for different processes is protected from each other different processes should be able to share information and therefore access the same part of memory.
Logical organization
Programs are often organized in modules. Some of these modules could be shared between different programs, some are read only and some contain data that can be modified. The memory management is responsible for handling this logical organization that is different from the physical linear address space. One way to arrange this organization is segmentation.
Physical organization
Memory is usually divided into fast primary storage and slow secondary storage. Memory management in the operating system handles moving information between these two levels of memory.
Compaction
Compaction is the process of moving live objects to eliminate dead space between them. Some people call this compactifying, to distinguish it from techniques for compressing data structures.Compaction is used to avoid external fragmentation and to increase locality of reference.Compaction attacks the problem of fragmentation by moving all the allocated blocks to one end of memory, thus combining all the holes. Aside from the obvious cost of all that copying, there is an important limitation to compaction: Any pointers to a block need to be updated when the block is moved. Unless it is possible to find all such pointers, compaction is not possible. Pointers can stored in the allocated blocks themselves as well as other places in the client of the memory manager. In some situations, pointers can point not only to the start of blocks but also into their bodies. For example, if a block contains executable code, a branch instruction might be a pointer to another location in the same block. Compaction is performed in three phases. First, the new location of each block is calculated to determine the distance the block will be moved. Then each pointer is updated by adding to it the amount that the block it is pointing (in)to will be moved. Finally, the data is actually moved.

DEPTH FIRST SEARCH(DFS)
Depth-first search (DFS) is an algorithm for traversing or searching a tree, tree structure, or graph. Intuitively, one starts at the root (selecting some node as the root in the graph case) and explores as far as possible along each branch before backtracking.
Formally, DFS is an uninformed search that progresses by expanding the first child node of the search tree that appears and thus going deeper and deeper until a goal node is found, or until it hits a node that has no children. Then the search backtracks, returning to the most recent node it hadn't finished exploring. In a non-recursive implementation, all freshly expanded nodes are added to a LIFO stack for exploration.
Space complexity of DFS is much lower than BFS (breadth-first search). It also lends itself much better to heuristic methods of choosing a likely-looking branch. Time complexity of both algorithms are proportional to the number of vertices plus the number of edges in the graphs they traverse .When searching large graphs that cannot be fully contained in memory, DFS suffers from non-termination when the length of a path in the search tree is infinite. The simple solution of "remember which nodes I have already seen" doesn't always work because there can be insufficient memory. This can be solved by maintaining an increasing limit on the depth of the tree, which is called iterative deepening depth-first search.
Algorithm for DFS
i. Declare two empty lists: Open and Closed.
ii. Add Start node to our Open list.
iii. While our Open list is not empty, loop the following:
a. Remove the first node from our Open List.
b. Check to see if the removed node is our destination.

i. If the removed node is our destination, break out of the loop, add the node to our Closed list, and return the value of our Closed list.
ii. If the removed node is not our destination, continue the loop (go to Step c).
c. Extract the neighbors of our above removed node.
d. Add the neighbors to the beginning of our Open list, and add the removed node to our Closed list. Continue looping.
BREADTH FIRST SEARCH(BFS)
breadth-first search (BFS) is a graph search algorithm that begins at the root node and explores all the neighbouring nodes. Then for each of those nearest nodes, it explores their unexplored neighbour nodes, and so on, until it finds the goal. BFS is a uninformed search method that aims to expand and examine all nodes of a graph systematically in search of a solution. In other words, it exhaustively searches the entire graph without considering the goal until it finds it. It does not use a heuristic.
From the standpoint of the algorithm, all child nodes obtained by expanding a node are added to a FIFO queue. In typical implementations, nodes that have not yet been examined for their neighbors are placed in some container (such as a queue or linked list) called "open" and then once examined are placed in the container "closed".
ALGORITHM FOR BFS
i. Declare two empty lists: Open and Closed.
ii. Add Start node to our Open list.
iii. While our Open list is not empty, loop the following:

a. Remove the first node from our Open List.
b. Check to see if the removed node is our destination.

i. If the removed node is our destination, break out of the loop, add the node to our Closed list, and return the value of our Closed list.
ii. If the removed node is not our destination, continue the loop (go to Step c).
c. Extract the neighbors of our above removed node.
d. Add the neighbors to the end of our Open list, and add the removed node to our Closed list.

Resources, Assignments and Clarifications.....

Thursday, December 17, 2009

UNIT 5 Lecture Notes

No comments:

Contents

Blog Archive