Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

translate Zeppelin Tutorial Docuent to Korean #10

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,3 +61,5 @@ If you wish to help us and contribute to Zeppelin Documentation, please look at
```
3. copy `zeppelin/docs/_site` to `asf-zeppelin/site/docs/[VERSION]`
4. ```svn commit```

Lee-Seonmi
63 changes: 31 additions & 32 deletions docs/quickstart/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,29 +19,29 @@ limitations under the License.
-->
{% include JB/setup %}

# Zeppelin Tutorial
# 제플린 튜토리얼

<div id="toc"></div>

This tutorial walks you through some of the fundamental Zeppelin concepts. We will assume you have already installed Zeppelin. If not, please see [here](../install/install.html) first.
이 튜토리얼은 핵심 제플린 개념의 일부를 소개합니다. 튜토리얼에 들어가기 전에 제플린을 먼저 설치해야합니다.그렇지 않으면 [이 곳](../install/install.html)을 먼저 참조합니다.

Current main backend processing engine of Zeppelin is [Apache Spark](https://spark.apache.org). If you're new to this system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin.
현재 제플린의 주요 백엔드 처리 엔진은 [Apache Spark](https://spark.apache.org)입니다. 이 시스템이 처음이라면, 스파크가 제플린을 최대한 활용하기 위해 데이터를 어떻게 처리하는 지에 대한 방안을 가지고 시작하길 원할 것입니다.

## Tutorial with Local File
## 로컬 파일을 이용한 튜토리얼

### Data Refine
### 데이터 정제

Before you start Zeppelin tutorial, you will need to download [bank.zip](http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip).
튜토리얼을 시작하기 전에, [bank.zip](http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip)을 먼저 다운로드 받습니다.

First, to transform csv format data into RDD of `Bank` objects, run following script. This will also remove header using `filter` function.
우선, csv 형식 데이터를 Bank 객체의 RRD로 변환하기 위해 아래 스크립트를 실행한다. 또한 filter 함수를 사용해서 헤더를 제거합니다.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

문체를 통일 부탁드립니다: 실행한다.->실행합니다.


```scala

val bankText = sc.textFile("yourPath/bank/bank-full.csv")

case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)

// split each line, filter out header (starts with "age"), and map it into Bank case class
// 각 라인을 분리하여 "age"로 시작하는 헤더를 걸러내고, 'Bank' case class로 매핑합니다.
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(
s=>Bank(s(0).toInt,
s(1).replaceAll("\"", ""),
Expand All @@ -51,38 +51,37 @@ val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(
)
)

// convert to DataFrame and create temporal table
// DataFrame으로 변환하고 임시 테이블을 테이블을 생성합니다.
bank.toDF().registerTempTable("bank")
```

### Data Retrieval
### 데이터 검색

Suppose we want to see age distribution from `bank`. To do this, run:
bank의 나이 분포를 확인하려면, 아래를 실행합니다.

```sql
%sql select age, count(1) from bank where age < 30 group by age order by age
```

You can make input box for setting age condition by replacing `30` with `${maxAge=30}`.
`30``${maxAge=30}`으로 대체해서 나이 조건을 설정하는 입력 상자를 만들 수 있습니다.

```sql
%sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age
```

Now we want to see age distribution with certain marital status and add combo box to select marital status. Run:
혼인 여부를 포함한 나이 분포를 확인하고, 혼인 여부를 선택할 선택 박스를 추가하려면, 아래를 실행합니다.ㄴ
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

마지막 문장에 오타가 있네요


```sql
%sql select age, count(1) from bank where marital="${marital=single,single|divorced|married}" group by age order by age
```

<br />
## Tutorial with Streaming Data
## 스트리밍 데이터를 이용한 튜토리얼

### Data Refine
### 데이터 정제

Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look at [Twitter Credential Setup](https://databricks-training.s3.amazonaws.com/realtime-processing-with-spark-streaming.html#twitter-credential-setup). After you get API keys, you should fill out credential related values(`apiKey`, `apiSecret`, `accessToken`, `accessTokenSecret`) with your API keys on following script.

This will create a RDD of `Tweet` objects and register these stream data as a table:
이 튜토리얼은 트위터의 샘플 트윗 스트림을 기반으로 하기때문에, 트위터 계정으로 인증이 되어야합니다. 인증하기 위해, [Twitter Credential Setup](https://databricks-training.s3.amazonaws.com/realtime-processing-with-spark-streaming.html#twitter-credential-setup)을 참조합니다. API 키를 받은 후, 아래 스크립트에 자격 증명 관련 값(`apiKey`, `apiSecret`, `accessToken`, `accessTokenSecret`)을 API 키로 채워야합니다.
아래 스크립트는 Tweet 객체의 RDD를 생성하고, 스트림 데이터를 테이블로 등록합니다.

```scala
import org.apache.spark.streaming._
Expand All @@ -95,7 +94,7 @@ import org.apache.log4j.Logger
import org.apache.log4j.Level
import sys.process.stringSeqToProcess

/** Configures the Oauth Credentials for accessing Twitter */
/** 트위터에 접근하기 위한 Oauth 자격 증명을 구성 */
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) {
val configs = new HashMap[String, String] ++= Seq(
"apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret)
Expand All @@ -111,7 +110,7 @@ def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken:
println()
}

// Configure Twitter credentials
// 트위터 자격증명 구성
val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx"
val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Expand All @@ -127,9 +126,9 @@ case class Tweet(createdAt:Long, text:String)
twt.map(status=>
Tweet(status.getCreatedAt().getTime()/1000, status.getText())
).foreachRDD(rdd=>
// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use rdd.registerTempTable("tweets") instead.
// B아래 코드는 spark 1.3.0에서만 작동합니다.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

띄어쓰기: 1.3.0에서만 작동합니다.->1.3.0 에서만 작동합니다.

// spark 1.1.x and spark 1.2.x 에서는
// rdd.registerTempTable("tweets") 을 사용해야합니다.
rdd.toDF().registerAsTable("tweets")
)

Expand All @@ -138,24 +137,24 @@ twt.print
ssc.start()
```

### Data Retrieval
### 데이터 검색

For each following script, every time you click run button you will see different result since it is based on real-time data.
아래 각 스크립트는 실시간 데이터를 기반으로하기때문에 실행 버튼을 클릭할 때마다 다른 결과값을 출력합니다.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

맞춤법 오류: 기반으로하기때문에->기반으로 하므로


Let's begin by extracting maximum 10 tweets which contain the word **girl**.
단어 **girl**을 포함하는 최대 10개의 트윗을 추출해봅시다.

```sql
%sql select * from tweets where text like '%girl%' limit 10
```

This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run:
지난 60초 동안 초당 얼마나 많은 트윗이 생성되었는지 확인해봅시다.

```sql
%sql select createdAt, count(1) from tweets group by createdAt order by createdAt
```


You can make user-defined function and use it in Spark SQL. Let's try it by making function named `sentiment`. This function will return one of the three attitudes( positive, negative, neutral ) towards the parameter.
또한, 사용자 정의 함수를 만들어서 스파크 SQL에서 사용할 수도 있습니다. `sentiment`라는 함수를 만들어서 연습해봅시다. 이 함수는 파라미터에 대하여 세 가지 속성(긍정, 부정, 중립) 중 하나를 반환합니다.

```scala
def sentiment(s:String) : String = {
Expand Down Expand Up @@ -184,14 +183,14 @@ def sentiment(s:String) : String = {
"neutral"
}

// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use sqlc.registerFunction("sentiment", sentiment _) instead.
// 아래 코드는 spark 1.3.0에서만 작동합니다.
// spark 1.1.x and spark 1.2.x 에서는
// sqlc.registerFunction("sentiment", sentiment _) 을 사용해야합니다.
sqlc.udf.register("sentiment", sentiment _)

```

To check how people think about girls using `sentiment` function we've made above, run this:
위에서 만든 `sentiment` 함수를 사용하여 사람들이 'girl'에 대해 어떻게 생각하는지 확인하기위해 아래를 실행합니다.

```sql
%sql select sentiment(text), count(1) from tweets where text like '%girl%' group by sentiment(text)
Expand Down
198 changes: 198 additions & 0 deletions docs/quickstart/tutorial_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
---
layout: page
title: "Apache Zeppelin Tutorial"
description: "This tutorial page contains a short walk-through tutorial that uses Apache Spark backend. Please note that this tutorial is valid for Spark 1.3 and higher."
group: quickstart
---
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
{% include JB/setup %}

# Zeppelin Tutorial

<div id="toc"></div>

This tutorial walks you through some of the fundamental Zeppelin concepts. We will assume you have already installed Zeppelin. If not, please see [here](../install/install.html) first.

Current main backend processing engine of Zeppelin is [Apache Spark](https://spark.apache.org). If you're new to this system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin.

## Tutorial with Local File

### Data Refine

Before you start Zeppelin tutorial, you will need to download [bank.zip](http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip).

First, to transform csv format data into RDD of `Bank` objects, run following script. This will also remove header using `filter` function.

```scala

val bankText = sc.textFile("yourPath/bank/bank-full.csv")

case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)

// split each line, filter out header (starts with "age"), and map it into Bank case class
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(
s=>Bank(s(0).toInt,
s(1).replaceAll("\"", ""),
s(2).replaceAll("\"", ""),
s(3).replaceAll("\"", ""),
s(5).replaceAll("\"", "").toInt
)
)

// convert to DataFrame and create temporal table
bank.toDF().registerTempTable("bank")
```

### Data Retrieval

Suppose we want to see age distribution from `bank`. To do this, run:

```sql
%sql select age, count(1) from bank where age < 30 group by age order by age
```

You can make input box for setting age condition by replacing `30` with `${maxAge=30}`.

```sql
%sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age
```

Now we want to see age distribution with certain marital status and add combo box to select marital status. Run:

```sql
%sql select age, count(1) from bank where marital="${marital=single,single|divorced|married}" group by age order by age
```

<br />
## Tutorial with Streaming Data

### Data Refine

Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look at [Twitter Credential Setup](https://databricks-training.s3.amazonaws.com/realtime-processing-with-spark-streaming.html#twitter-credential-setup). After you get API keys, you should fill out credential related values(`apiKey`, `apiSecret`, `accessToken`, `accessTokenSecret`) with your API keys on following script.

This will create a RDD of `Tweet` objects and register these stream data as a table:

```scala
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File
import org.apache.log4j.Logger
import org.apache.log4j.Level
import sys.process.stringSeqToProcess

/** Configures the Oauth Credentials for accessing Twitter */
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) {
val configs = new HashMap[String, String] ++= Seq(
"apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret)
println("Configuring Twitter OAuth")
configs.foreach{ case(key, value) =>
if (value.trim.isEmpty) {
throw new Exception("Error setting authentication - value for " + key + " not set")
}
val fullKey = "twitter4j.oauth." + key.replace("api", "consumer")
System.setProperty(fullKey, value.trim)
println("\tProperty " + fullKey + " set as [" + value.trim + "]")
}
println()
}

// Configure Twitter credentials
val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx"
val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessTokenSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret)

import org.apache.spark.streaming.twitter._
val ssc = new StreamingContext(sc, Seconds(2))
val tweets = TwitterUtils.createStream(ssc, None)
val twt = tweets.window(Seconds(60))

case class Tweet(createdAt:Long, text:String)
twt.map(status=>
Tweet(status.getCreatedAt().getTime()/1000, status.getText())
).foreachRDD(rdd=>
// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use rdd.registerTempTable("tweets") instead.
rdd.toDF().registerAsTable("tweets")
)

twt.print

ssc.start()
```

### Data Retrieval

For each following script, every time you click run button you will see different result since it is based on real-time data.

Let's begin by extracting maximum 10 tweets which contain the word **girl**.

```sql
%sql select * from tweets where text like '%girl%' limit 10
```

This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run:

```sql
%sql select createdAt, count(1) from tweets group by createdAt order by createdAt
```


You can make user-defined function and use it in Spark SQL. Let's try it by making function named `sentiment`. This function will return one of the three attitudes( positive, negative, neutral ) towards the parameter.

```scala
def sentiment(s:String) : String = {
val positive = Array("like", "love", "good", "great", "happy", "cool", "the", "one", "that")
val negative = Array("hate", "bad", "stupid", "is")

var st = 0;

val words = s.split(" ")
positive.foreach(p =>
words.foreach(w =>
if(p==w) st = st+1
)
)

negative.foreach(p=>
words.foreach(w=>
if(p==w) st = st-1
)
)
if(st>0)
"positivie"
else if(st<0)
"negative"
else
"neutral"
}

// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use sqlc.registerFunction("sentiment", sentiment _) instead.
sqlc.udf.register("sentiment", sentiment _)

```

To check how people think about girls using `sentiment` function we've made above, run this:

```sql
%sql select sentiment(text), count(1) from tweets where text like '%girl%' group by sentiment(text)
```