Stratified Random Sampling

The Alogorithm

The algorithm is based on the paper "Scalable Simple Random Sampling and Stratified Sampling" (ScaSRS algorithm) by Xiangrui Meng This algorithm works effectively on large-scale data sets. When I test on small data set we almost accept all the data based on the crietrion from the paper.

Spark exercise

The Stratified Sampling is count based sampling that allocates different sample size for different stratas. The strata can be defined using function to append indicator for strata with data RDD. Here I developed "myAppendIndicator" function as an example. The basic steps for Stratified Random Sampling is:

Compute collection of Strata IDs
Filter out each Strata
Compute Strata size and corresponding sample size for each strata
ScaSRS for each strata (RDD filtered out from entire dataset)
- Assign Random value for each item in Strata
- Compute threshold for acception and rejection at flight
- Filter out the waiting list RDD
- Random Sorting based SRS method for waiting list
- Construct sampled RDD by taking Union of the accepted RDD and sampled RDD from waiting list
Construct sampled RDD for entire dataset, by taking Union of the sampled RDD of each strata

Working code with Spark-shell

Here the file "databrick1.scala" is a working code on Spark-shell. It demonstrates the usage of developed "myScaSRS(RDD,sampleSize)" function which is Serializable.

StandAlone Scala with Spark API

The code was modified briefly modified for standalone scala app. "databrick2.scala" is added.

Trait for ScaSRS with Spark API

The code was further modified with extended trait. The "RandomStratifiedSampler" extends from "RandomSampler" and with trait "StratifiedSampler" for methods including appending trata IDs, ScaSRS for each strata and overrided function of "sample" to sample Strata-by-Strata and combine with Union.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
databricks1.scala		databricks1.scala
databricks2.scala		databricks2.scala
databricks4.scala		databricks4.scala
pg100.txt		pg100.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stratified Random Sampling

The Alogorithm

Spark exercise

Working code with Spark-shell

StandAlone Scala with Spark API

Trait for ScaSRS with Spark API

About

Releases

Packages

Languages

shallinlin/StratifiedRandomSampling

Folders and files

Latest commit

History

Repository files navigation

Stratified Random Sampling

The Alogorithm

Spark exercise

Working code with Spark-shell

StandAlone Scala with Spark API

Trait for ScaSRS with Spark API

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages