Prajna


DSet<'U>

Generic distributed dataset. If storing the content of DSet to a persisted store, it is highly advisable that 'U used is either of .Net system type, or using a customized serializer/deserializer to serialize ('U)[]. Otherwise, the stored data may not be retrievable as the BinaryFormatter may fail to deserialize a class if the associated code has changed.

Constructors

ConstructorDescription
new()
Signature: unit -> DSet<'U>

Instance members

Instance memberDescription
BinSort(partFunc, comparer)
Signature: (partFunc:Func<'U,int> * comparer:IComparer<'U>) -> DSet<'U>

Bin sort the DSet. Apply a partition function, repartition elements of dataset across nodes in the cluster. The number of partitions remains unchanged. Elements within each partition/bin are sorted using the 'comparer'.

BinSortN(...)
Signature: (numPartitions:int * partFunc:Func<'U,int> * comparer:IComparer<'U>) -> DSet<'U>

Bin sort the DSet. Apply a partition function, repartition elements of dataset into 'numPartitions" partitions across nodes in the cluster. Elements within each partition/bin are sorted using the 'comparer'.

BinSortP(param, partFunc, comparer)
Signature: (param:DParam * partFunc:Func<'U,int> * comparer:IComparer<'U>) -> DSet<'U>

Bin sort the DSet. Apply a partition function, repartition elements of dataset across nodes in the cluster according to the setting specified by "param". Elements within each partition/bin are sorted using the 'comparer'.

Bypass()
Signature: unit -> DSet<'U> * DSet<'U>

Bypass the Dset in 2 ways. One of the DSet is pulled, the other DSet will be in push dataflow.

Bypass2()
Signature: unit -> DSet<'U> * DSet<'U>

Bypass the Dset in 2 ways. One of the DSet is pulled, the other DSet will be in push dataflow.

Bypass3()
Signature: unit -> DSet<'U> * DSet<'U> * DSet<'U>

Bypass the Dset in 3 ways. One of the DSet is pulled, the other two DSets will be in push dataflow.

Bypass4()
Signature: unit -> DSet<'U> * DSet<'U> * DSet<'U> * DSet<'U>

Bypass the Dset in 4 ways. One of the DSet is pulled, the other three DSets will be in push dataflow.

BypassN(nWays)
Signature: nWays:int -> DSet<'U> * DSet<'U> []

Bypass the Dset in n ways. One of the DSet is pulled, the other n - 1 DSets will be in push dataflow.

CacheInMemory()
Signature: unit -> DSet<'U>

Cache DSet content in memory, in raw object form.

Choose(func)
Signature: func:Func<'U,Nullable<'U1>> -> DSet<'U1>

Applies the given function to each element of the dataset. The new dataset comprised of the results for each element where the function returns some value or null.

Choose(func)
Signature: func:Func<'U,'U1> -> DSet<'U1>

Applies the given function to each element of the dataset. The new dataset comprised of the results for each element where the function returns some value or null.

Cluster()
Signature: unit -> Cluster

Get and Set Cluster

Cluster()
Signature: unit -> unit

Get and Set Cluster

Collect(func)
Signature: func:Func<'U,IEnumerable<'U1>> -> DSet<'U1>

Applies the given function to each element of the dataset and concatenates all the results.

ContentKey()
Signature: unit -> uint64

Set a content key for DSet that governs partition mapping, For two DSets that have the same content key, a single key will be mapped uniquely to a partition

ContentKey()
Signature: unit -> unit

Set a content key for DSet that governs partition mapping, For two DSets that have the same content key, a single key will be mapped uniquely to a partition

Count()
Signature: unit -> int64

Count the number of elements(rows) in the DSet

CrossJoin(mapFunc, x1)
Signature: (mapFunc:Func<'U,'U1,'U2> * x1:DSet<'U1>) -> DSet<'U2>
Type parameters: 'U2

Cross Join, for each entry in x:Dset<'U> and x1:Dset<'U1> apply mapFunc:'U->'U1->'U2 The resultant set is the product of Dset<'U> and Dset<'U1>

CrossJoinChoose(mapFunc, x1)
Signature: (mapFunc:Func<'U,'U1,Nullable<'U2>> * x1:DSet<'U1>) -> DSet<'U2>
Type parameters: 'U2

Cross Join with Filtering, for each entry in x:Dset<'U> and x1:Dset<'U1> apply mapFunc. Filter out all those result that returns null.

CrossJoinChoose(mapFunc, x1)
Signature: (mapFunc:Func<'U,'U1,'U2> * x1:DSet<'U1>) -> DSet<'U2>
Type parameters: 'U2

Cross Join with Filtering, for each entry in x:Dset<'U> and x1:Dset<'U1> apply mapFunc Filter out all those result that returns null.

CrossJoinFold(...)
Signature: (mapFunc:Func<'U,'U1,'U2> * foldFunc:Func<'S,'U2,'S> * initialState:'S * x1:DSet<'U1>) -> DSet<'S>
Type parameters: 'U2, 'S

Cross Join with Fold, for each entry in x:DSet<'U> and x1:DSet<'U1> apply mapFunc:'U->'U1->'U2, and foldFunc: 'S->'U2->'S

Distribute(sourceEnumerable)
Signature: sourceEnumerable:IEnumerable<'U> -> DSet<'U>

Span an IEnumerable (dataset) to a distributed dataset by splitting the sequence to N partitions, with N being the number of nodes in the cluster. Each node get one partition of the dataset. The number of partitions of the distributed dataset is N, which is the number of the active nodes in the cluster.

DistributeN(num, sourceEnumerable)
Signature: (num:int * sourceEnumerable:IEnumerable<'U>) -> DSet<'U>

Span an IEnumerable (dataset) to a distributed dataset by splitting the sequence to NUM * N partitions, with NUM being the number of partitions on each node, and N being the number of nodes in the cluster. Each node get NUM partition of the dataset.

Execute(func)
Signature: func:Action -> unit

Execute a distributed customerized functional delegate on each of the machine. The functional delegate is running remotely on each node. If any node in the cluster is not responding, the functional delegate does not run on that node.

ExecuteN(num, funcN)
Signature: (num:int * funcN:Action<int>) -> unit

Execute num distributed customerized functional delegates on each of the machine. The functional delegate is running remotely on each node. If any node in the cluster is not responding, the functional delegates do not run on that node.

Filter(func)
Signature: func:Func<'U,bool> -> DSet<'U>

Creates a new dataset containing only the elements of the dataset for which the given predicate returns true.

Fold(folder, aggrFunc, state)
Signature: (folder:Func<'State,'U,'State> * aggrFunc:Func<'State,'State,'State> * state:'State) -> 'State

Fold the entire DSet with a fold function, an aggregation function, and an initial state. The initial state is deserialized (separately) for each partition. Within each partition, the elements are folded into the state variable using 'folder' function. Then 'aggrFunc' is used to aggregate the resulting state variables from all partitions to a single state.

Identity()
Signature: unit -> DSet<'U>

Identity Mapping, the new DSet is the same as the old DSet, with an encode function installed.

Import(serversInfo, importName)
Signature: (serversInfo:ContractServersInfo * importName:string) -> DSet<'U>

Generate a distributed dataset by importing a customerized functional delegate from a local contract store. The imported functional delegate is usually exported by another service (either in the same remote container or in anotherPrajna remote container. Each node runs one local instance of the functional delegate, which forms one partition of the dataset. The number of partitions of dataset is N, where N is the number of the nodes in the cluster. If any node in the cluster is not responding, the dataset does not contain the data resultant from the functional delegates in that node.

ImportN(serversInfo, importNames)
Signature: (serversInfo:ContractServersInfo * importNames:string []) -> DSet<'U>

Generate a distributed dataset by importing customerized functional delegates from a local contract store. The imported functional delegate is usually exported by another service (either in the same remote container or in anotherPrajna remote container. Each node runs multiple local instance of the functional delegates, each of which forms one partition of the dataset. The number of partitions of dataset is N * num , where N is the number of the nodes in the cluster, and num is the number of contracts in the contract list. If any node in the cluster is not responding, the dataset does not contain the data resultant from the functional delegates in that node.

Init(initFunc, partitionSizeFunc)
Signature: (initFunc:Func<(int * int),'U> * partitionSizeFunc:Func<int,int>) -> DSet<'U>

Create a distributed dataset on the distributed cluster, with each element created by a functional delegate.

InitN(initFunc, partitionSizeFunc)
Signature: (initFunc:Func<(int * int),'U> * partitionSizeFunc:Func<int,int,int>) -> DSet<'U>

Create a distributed dataset on the distributed cluster, with each element created by a functional delegate, using a given number of parallel execution per node. The functional delegate that create each element in the dataset, the integer index passed to the function indicates the partition, and the second integer passed to the function index (from 0) element within the partition. The functional delegate that returns the size of the partition, the integer index passed to the function indicates the partition.

InitS(initFunc, partitionSize)
Signature: (initFunc:Func<(int * int),'U> * partitionSize:int) -> DSet<'U>

Create a distributed dataset on the distributed cluster, with each element created by a functional delegate.

IsPartitionByKey()
Signature: unit -> bool

Is the partition of DSet formed by a key function that maps a data item to an unique partition

IsPartitionByKey()
Signature: unit -> unit

Is the partition of DSet formed by a key function that maps a data item to an unique partition

Iter(iterFunc)
Signature: iterFunc:Action<'U> -> unit

Applies a function 'iterFunc' to each element

LazySaveToHDD()
Signature: unit -> unit

Lazily save the DSet to HDD. This is a lazy evaluation. This DSet must be a branch generated by Bypass or a child of such branch. The save action will be triggered when a pull happens on other branches generated by Bypass.

LazySaveToHDDByName(name)
Signature: name:string -> unit

Lazily save the DSet to HDD using "name". This is a lazy evaluation. This DSet must be a branch generated by Bypass or a child of such branch. The save action will be triggered when a pull happens on other branches generated by Bypass.

LoadSource()
Signature: unit -> DSet<'U>

Trigger to load metadata.

LocalIter(iterFunc)
Signature: iterFunc:Action<'U> -> unit

Read DSet back to local machine and apply a function on each value. Caution: the iter function will be executed locally as the entire DSet is read back.

Map(func)
Signature: func:Func<'U,'U1> -> DSet<'U1>

Creates a new dataset whose elements are the results of applying the given function to each of the elements of the dataset. The given function will be applied per collection as the dataset is being distributedly iterated. The entire dataset never materialize entirely.

Map2(func, x1)
Signature: (func:Func<'U,'U1,'V> * x1:DSet<'U1>) -> DSet<'V>
Type parameters: 'V

Create a new DSet whose elements are the results of applying the given function to the corresponding elements of the two DSets pairwise The two DSet must have the same partition mapping structure and same number of element (e.g.), established via Split.

Map3(func, x1, x2)
Signature: (func:Func<'U,'U1,'U2,'V> * x1:DSet<'U1> * x2:DSet<'U2>) -> DSet<'V>
Type parameters: 'U2, 'V

Create a new DSet whose elements are the results of applying the given function to the corresponding elements of the three DSets pairwise

Map4(func, x1, x2, x3)
Signature: (func:Func<'U,'U1,'U2,'U3,'V> * x1:DSet<'U1> * x2:DSet<'U2> * x3:DSet<'U3>) -> DSet<'V>
Type parameters: 'U2, 'U3, 'V

Create a new DSet whose elements are the results of applying the given function to the corresponding elements of the four DSets pairwise

MapByCollection(func)
Signature: (func:Func<'U [],'U1 []>) -> DSet<'U1>

Creates a new dataset whose elements are the results of applying the given function to each collection of the elements of the dataset. In the input DSet, a parttition can have multiple collections, the size of the which is determined by the SerializationLimit of the cluster.

Mapi(func)
Signature: func:Func<int,int64,'U,'U1> -> DSet<'U1>

Creates a new dataset whose elements are the results of applying the given function to each of the elements of the dataset. The first integer index passed to the function indicates the partition, and the second integer passed to the function index (from 0) element within the partition

MapReduce(mapFunc, reduceFunc)
Signature: (mapFunc:Func<'U,IEnumerable<'K1 * 'V1>> * reduceFunc:Func<('K1 * List<'V1>),'U2>) -> DSet<'U2>
Type parameters: 'V1, 'U2

MapReduce, see http://en.wikipedia.org/wiki/MapReduce

MapReduceP(param, mapFunc, reduceFunc)
Signature: (param:DParam * mapFunc:Func<'U,IEnumerable<'K1 * 'V1>> * reduceFunc:Func<('K1 * List<'V1>),'U2>) -> DSet<'U2>
Type parameters: 'V1, 'U2

MapReduce, see http://en.wikipedia.org/wiki/MapReduce

MapReducePWithPartitionFunction(...)
Signature: (param:DParam * mapFunc:Func<'U,IEnumerable<'K1 * 'V1>> * partFunc:Func<'K1,int> * reduceFunc:Func<('K1 * List<'V1>),'U2>) -> DSet<'U2>
Type parameters: 'V1, 'U2

MapReduce, see http://en.wikipedia.org/wiki/MapReduce

Mix(x1)
Signature: x1:DSet<'U1> -> DSet<'U * 'U1>

Mixing two DSets that have the same size and partition layout into a single DSet by operating a function on the individual data (row).

Mix2(x1)
Signature: x1:DSet<'U1> -> DSet<'U * 'U1>

Mixing two DSets that have the same size and partition layout into a single DSet by operating a function on the individual data (row).

Mix3(x1, x2)
Signature: (x1:DSet<'U1> * x2:DSet<'U2>) -> DSet<'U * 'U1 * 'U2>
Type parameters: 'U2

Mixing three DSets that have the same size and partition layout into a single DSet by operating a function on the individual data (row).

Mix4(x1, x2, x3)
Signature: (x1:DSet<'U1> * x2:DSet<'U2> * x3:DSet<'U3>) -> DSet<'U * 'U1 * 'U2 * 'U3>
Type parameters: 'U2, 'U3

Mixing four DSets that have the same size and partition layout into a single DSet by operating a function on the individual data (row).

Multicast()
Signature: unit -> DSet<'U>

Multicast a DSet over network, replicate its content to all peers in the cluster.

Name()
Signature: unit -> string

Name of the DSet

Name()
Signature: unit -> unit

Name of the DSet

NumParallelExecution()
Signature: unit -> int

Maximum number of parallel threads that will execute the data analytic jobs in a remote container. If 0, the remote container will determine the number of parallel threads used according to its computation and memory resource available.

NumParallelExecution()
Signature: unit -> unit

Maximum number of parallel threads that will execute the data analytic jobs in a remote container. If 0, the remote container will determine the number of parallel threads used according to its computation and memory resource available.

NumPartitions()
Signature: unit -> int

Number of partitions

NumPartitions()
Signature: unit -> unit

Number of partitions

NumReplications()
Signature: unit -> int

Required number of replications for durability

NumReplications()
Signature: unit -> unit

Required number of replications for durability

ParallelMap(func)
Signature: func:Func<'U,Task<'U1>> -> DSet<'U1>

Map DSet , in which func is an Task<_> function that may contains asynchronous operation. You will need to start the 1st task in the mapping function. Prajna will not be able to start the task for you as the returned task may not be the a Task in the creation state. see: http://blogs.msdn.com/b/pfxteam/archive/2012/01/14/10256832.aspx

ParallelMapi(func)
Signature: func:Func<int,int64,'U,Task<'U1>> -> DSet<'U1>

Map DSet , in which func is an Task<_> function that may contains asynchronous operation. The first integer index passed to the function indicates the partition, and the second integer passed to the function index (from 0) element within the partition. You will need to start the 1st task in the mapping function. Prajna will not be able to start the task for you as the returned task may not be the a Task in the creation state. see: http://blogs.msdn.com/b/pfxteam/archive/2012/01/14/10256832.aspx

Password()
Signature: unit -> string

Password that will be hashed and used for triple DES encryption and decryption of data.

Password()
Signature: unit -> unit

Password that will be hashed and used for triple DES encryption and decryption of data.

PeerRcvdSpeedLimit()
Signature: unit -> int64

Flow control, limits the total bytes send out to PeerRcvdSpeedLimit If it is communicating with N peer, each peer, the sending queue limit is PeerRcvdSpeedLimit/N

PeerRcvdSpeedLimit()
Signature: unit -> unit

Flow control, limits the total bytes send out to PeerRcvdSpeedLimit If it is communicating with N peer, each peer, the sending queue limit is PeerRcvdSpeedLimit/N

PostGroupByReserialization()
Signature: unit -> int

In BinSort/MapReduce, indicate the collection size after the a collection of data is received from network

PostGroupByReserialization()
Signature: unit -> unit

In BinSort/MapReduce, indicate the collection size after the a collection of data is received from network

PreGroupByReserialization()
Signature: unit -> int

In BinSort/MapReduce, indicate whether need to regroup collection before sending a collection of data across network

PreGroupByReserialization()
Signature: unit -> unit

In BinSort/MapReduce, indicate whether need to regroup collection before sending a collection of data across network

Reduce(reducer)
Signature: reducer:Func<'U,'U,'U> -> 'U

Reduces the elements using the specified 'reducer' function

ReorgWDegree(numParallelJobs)
Signature: numParallelJobs:int -> DSet<'U>

Reorganization collection in a dataset to accommodate a certain parallel execution degree.

Repartition(partFunc)
Signature: partFunc:Func<'U,int> -> DSet<'U>

Apply a partition function, repartition elements of dataset across nodes in the cluster. The number of partitions remains unchanged.

RepartitionN(numPartitions, partFunc)
Signature: (numPartitions:int * partFunc:Func<'U,int>) -> DSet<'U>

Apply a partition function, repartition elements of dataset into 'numPartitions" partitions across nodes in the cluster.

RepartitionP(param, partFunc)
Signature: (param:DParam * partFunc:Func<'U,int>) -> DSet<'U>

Apply a partition function, repartition elements of dataset across nodes in the cluster according to the setting specified by "param".

RowsMergeAll()
Signature: unit -> DSet<'U>

merge all rows of a partition into a single collection object

RowsReorg(numRows)
Signature: numRows:int -> DSet<'U>

Reorganization collection of a dataset. If numRows = 1, the dataset is split into one row per collection. If numRows = Int32.MaxValue, the data set is merged so that all rows in a partition is grouped into a single collection.

RowsSplit()
Signature: unit -> DSet<'U>

Reorganization collection of a dataset so that each collection has a single row.

SaveToHDD()
Signature: unit -> unit

Save the DSet to HDD. This is an action.

SaveToHDDByName(name)
Signature: name:string -> unit

Save the DSet to HDD using "name". This is an action.

SaveToHDDByNameWithMonitor(...)
Signature: (monitorFunc:Func<'U,bool> * localIterFunc:Action<'U> * name:string) -> unit

Save the DSet to HDD using "name". This is an action. Attach a monitor function that select elements that will be fetched to local machine to be iterated by 'localIterFunc'

SaveToHDDWithMonitor(...)
Signature: (monitorFunc:Func<'U,bool> * localIterFunc:Action<'U>) -> unit

Save the DSet to HDD. This is an action. Attach a monitor function that select elements that will be fetched to local machine to be iterated by 'localIterFunc'

SendingQueueLimit()
Signature: unit -> int

Sender flow control, DSet/DStream limits the total sending queue to SendingQueueLimit If it is communicating with N peer, each peer, the sending queue limit is SendingQueueLimit/N

SendingQueueLimit()
Signature: unit -> unit

Sender flow control, DSet/DStream limits the total sending queue to SendingQueueLimit If it is communicating with N peer, each peer, the sending queue limit is SendingQueueLimit/N

SerializationLimit()
Signature: unit -> int

Number of record in a collection during data analytical jobs. This parameter will not change number of record in an existing collection of a DSet. To change the number of record of an existing collection, please use RowsReorg().

SerializationLimit()
Signature: unit -> unit

Number of record in a collection during data analytical jobs. This parameter will not change number of record in an existing collection of a DSet. To change the number of record of an existing collection, please use RowsReorg().

SizeInBytes
Signature: int64

Get the size of all key-values or blobs in DSet

Source(sourceSeqFunc)
Signature: sourceSeqFunc:Func<IEnumerable<'U>> -> DSet<'U>

Generate a distributed dataset through a customerized seq functional delegate running on each of the machine. Each node runs a local instance of the functional delegate, which generates a seq('U) that forms one partition of DSet('U). The number of partitions of DSet('U) is N, which is the number of the nodes in the cluster. The NumReplications is 1. If any node in the cluster is not responding, the dataset does not contain the data resultant from the functional delegate in that node.

SourceI(numPartitions, sourceISeqFunc)
Signature: (numPartitions:int * sourceISeqFunc:Func<int,IEnumerable<'U>>) -> DSet<'U>

Generate a distributed dataset through customerized seq functional delegates running on each of the machine. The number of partitions of dataset is numPartitions. If any node in the cluster is not responding, the partition may be rerun at a different node.

SourceN(num, sourceNSeqFunc)
Signature: (num:int * sourceNSeqFunc:Func<int,IEnumerable<'U>>) -> DSet<'U>

Generate a distributed dataset through customerized seq functional delegates running on each of the machine. Each node runs num local instance of the functional delegates, each of which forms one partition of the dataset. The number of partitions of dataset is N * num, where N is the number of the nodes in the cluster, and num is the number of partitions per node. The NumReplications is 1. If any node in the cluster is not responding, the dataset does not contain the data resultant from the functional delegates in that node.

Split2(fun0, fun1)
Signature: (fun0:Func<'U,'U0> * fun1:Func<'U,'U1>) -> DSet<'U0> * DSet<'U1>
Type parameters: 'U1

Correlated split a dataset to two, each of which is created by running a functional delegate that maps the element of the original dataset. The resultant datasets all have the same partition and collection structure of the original dataset. They can be combined later by Map2 transforms.

Split3(fun0, fun1, fun2)
Signature: (fun0:Func<'U,'U0> * fun1:Func<'U,'U1> * fun2:Func<'U,'U2>) -> DSet<'U0> * DSet<'U1> * DSet<'U2>
Type parameters: 'U1, 'U2

Correlated split a dataset to three, each of which is created by running a functional delegate that maps the element of the original dataset. The resultant datasets all have the same partition and collection structure of the original dataset. They can be combined later by Map3 transforms.

Split4(fun0, fun1, fun2, fun3)
Signature: (fun0:Func<'U,'U0> * fun1:Func<'U,'U1> * fun2:Func<'U,'U2> * fun3:Func<'U,'U3>) -> DSet<'U0> * DSet<'U1> * DSet<'U2> * DSet<'U3>
Type parameters: 'U1, 'U2, 'U3

Correlated split a dataset to four, each of which is created by running a functional delegate that maps the element of the original dataset. The resultant datasets all have the same partition and collection structure of the original dataset. They can be combined later by Map4 transforms.

StorageType()
Signature: unit -> StorageKind

Storage Type, which include StorageMedia and IndexMethod

StorageType()
Signature: unit -> unit

Storage Type, which include StorageMedia and IndexMethod

Store(o)
Signature: o:IEnumerable<'U> -> unit

Store an IEnumerable to a persisted DSet

ToIEnumerable()
Signature: unit -> IEnumerable<'U>

Convert DSet to an IEnumerable<'U>

TypeOfLoadBalancer()
Signature: unit -> LoadBalanceAlgorithm

Get or Set Load Balancer Note that the change will affect Partitioner

TypeOfLoadBalancer()
Signature: unit -> unit

Get or Set Load Balancer Note that the change will affect Partitioner

Union(source)
Signature: source:IEnumerable<DSet<'U>> -> DSet<'U>

Merge the content of multiple dataset into a single dataset. The original dataset become partitions of the resultant dataset. This can be considered as merge dataset by rows, and all dataset have the same column structure.

Version()
Signature: unit -> DateTime

Get and Set version

Version()
Signature: unit -> unit

Get and Set version

VersionString
Signature: string

Represent version in a string for display

Static members

Static memberDescription
TryFind(cluster, searchPattern)
Signature: (cluster:Cluster * searchPattern:string) -> DSet<'?14402> []
Fork me on GitHub