Package park :: Package core :: Module datastore

Module datastore

source code

Data storage services.

Warning: Not yet used

Data store to manage local data, both job input and job output.

This includes URL prefetch, results storage and purging.

Design Notes

Desired features:

scalability - read and write a block at a time
efficiency - check if file already fetched
robustness - checksum to make sure fetched file is correct
security - allow access to secure urls
usability - avoid reentering passwords on accessible resources

What happens when file at the url changes? What happens when urllib drops the connection before the transfer is complete?

File caching services

Want something light weight and robust. Filesystem already exists and acts as a database. Just need a string key tied to a file name.

Warning that unix does not like large directories, and this could kill performance, unless a multilevel structure is created. Currently we are ignoring the problem and putting things in one big directory. Breaking the hash into e.g., 4+4+12 would give a 3-level directory structure.

Ultimately the caller doesn't care, but if we change it, then we may need to provide a migration path for the server.

The SNS datasets can involve many large files which are not generally reused from job to job. It would be nice to have cache aging based on last access time. This can probably be done after the fact by walking the cache directory. It may not work on all operating systems if access time is not properly supported.

Administrator needs to be able to get a snapshot of the cache sorted by size of cached object, last access time, etc.

For performance, keys accessed in this session are stored in a dictionary. Consider making this a weak reference so that long running servers do not show memory leaks. Better yet, kill server periodically since it should be robust enough to recover, and many problems of long running services vanish, as well as regular testing of robustness algorithms.

Staging

The use cases we need to cover:

1. cluster with no local storage on nodes
2. cluster with local storage on nodes but no internet connection
3. cluster with local storage on nodes and an internet connection
4. independent nodes on the same subnet as server

All these use cases are satisfied by staging the prefetch URL on a local server where the nodes can access it when necessary. The work mapper itself should strive to keep compuations on nodes which already have the problem loaded to reduce network bandwidth and file load times, but this is a separate issue.

There are two additional use cases we will not cover:

0. files too large for a single node
5. completely independent nodes

Case 0 requires that the individual work units be calculated in parallel. This will certainly require staging of the files on a local server.

Case 5 is a situation such BOINC where a number of independent computers from across the internet combine to work on a problem. In these situations we will not want demand large amounts of network bandwidth and local storage, so are only suitable for small files. This means there will be minimal penalty for staging the file on a local server.

Signalling

File transfer must occur asynchronously. The compute nodes should not block as files are being staged. This leads to the question of the asynchronous behaviour should be implemented. We look at several possibilities:

1. Local data server on separate node of the cluster
2. Local data server on main server running as separate process
3. Local data server on main server running as separate thread
4. Local data server on main server with async I/O

I'm not sure 4 is possible in Python. 3 should be the default to reduce the number of processes and difficulty managing them on the server. However, because we need to support 1 on some architecures, we will get 2 for free. In any case, 1-4 can be supported by adding a server command saying that a precondition is met, and the server does not need a busy loop to check a status flag in addition to the usual listening for client connections.

Credentials

When fetching data from services such as the SNS portal the user should not have to repeatedly enter a password. Credentials should be cached. The mechanism for this is unclear.

Classes
  FileCache
Create a data cache in a particular directory on the filesystem.
  FetchRequest
  FetchService
Functions
 
config(key, missing) source code
Variables
  URL_CACHE_PATH = '/tmp/urlcache'
Standard location for the local repository
  urlcache = FileCache(URL_CACHE_PATH)
The local cache for URLs