-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
translate Zeppelin Tutorial Docuent to Korean #10
base: master
Are you sure you want to change the base?
Changes from 3 commits
4afbb6d
4a9b626
2cb4e6f
b39c496
22f2e4f
6d80841
19c85cd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,29 +19,29 @@ limitations under the License. | |
--> | ||
{% include JB/setup %} | ||
|
||
# Zeppelin Tutorial | ||
# 제플린 튜토리얼 | ||
|
||
<div id="toc"></div> | ||
|
||
This tutorial walks you through some of the fundamental Zeppelin concepts. We will assume you have already installed Zeppelin. If not, please see [here](../install/install.html) first. | ||
이 튜토리얼은 핵심 제플린 개념의 일부를 소개합니다. 튜토리얼에 들어가기 전에 제플린을 먼저 설치해야합니다.그렇지 않으면 [이 곳](../install/install.html)을 먼저 참조합니다. | ||
|
||
Current main backend processing engine of Zeppelin is [Apache Spark](https://spark.apache.org). If you're new to this system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin. | ||
현재 제플린의 주요 백엔드 처리 엔진은 [Apache Spark](https://spark.apache.org)입니다. 이 시스템이 처음이라면, 스파크가 제플린을 최대한 활용하기 위해 데이터를 어떻게 처리하는 지에 대한 방안을 가지고 시작하길 원할 것입니다. | ||
|
||
## Tutorial with Local File | ||
## 로컬 파일을 이용한 튜토리얼 | ||
|
||
### Data Refine | ||
### 데이터 정제 | ||
|
||
Before you start Zeppelin tutorial, you will need to download [bank.zip](http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip). | ||
튜토리얼을 시작하기 전에, [bank.zip](http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip)을 먼저 다운로드 받습니다. | ||
|
||
First, to transform csv format data into RDD of `Bank` objects, run following script. This will also remove header using `filter` function. | ||
우선, csv 형식 데이터를 Bank 객체의 RRD로 변환하기 위해 아래 스크립트를 실행한다. 또한 filter 함수를 사용해서 헤더를 제거합니다. | ||
|
||
```scala | ||
|
||
val bankText = sc.textFile("yourPath/bank/bank-full.csv") | ||
|
||
case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer) | ||
|
||
// split each line, filter out header (starts with "age"), and map it into Bank case class | ||
// 각 라인을 분리하여 "age"로 시작하는 헤더를 걸러내고, 'Bank' case class로 매핑합니다. | ||
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map( | ||
s=>Bank(s(0).toInt, | ||
s(1).replaceAll("\"", ""), | ||
|
@@ -51,38 +51,37 @@ val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map( | |
) | ||
) | ||
|
||
// convert to DataFrame and create temporal table | ||
// DataFrame으로 변환하고 임시 테이블을 테이블을 생성합니다. | ||
bank.toDF().registerTempTable("bank") | ||
``` | ||
|
||
### Data Retrieval | ||
### 데이터 검색 | ||
|
||
Suppose we want to see age distribution from `bank`. To do this, run: | ||
bank의 나이 분포를 확인하려면, 아래를 실행합니다. | ||
|
||
```sql | ||
%sql select age, count(1) from bank where age < 30 group by age order by age | ||
``` | ||
|
||
You can make input box for setting age condition by replacing `30` with `${maxAge=30}`. | ||
`30`을 `${maxAge=30}`으로 대체해서 나이 조건을 설정하는 입력 상자를 만들 수 있습니다. | ||
|
||
```sql | ||
%sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age | ||
``` | ||
|
||
Now we want to see age distribution with certain marital status and add combo box to select marital status. Run: | ||
혼인 여부를 포함한 나이 분포를 확인하고, 혼인 여부를 선택할 선택 박스를 추가하려면, 아래를 실행합니다.ㄴ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 마지막 문장에 오타가 있네요 |
||
|
||
```sql | ||
%sql select age, count(1) from bank where marital="${marital=single,single|divorced|married}" group by age order by age | ||
``` | ||
|
||
<br /> | ||
## Tutorial with Streaming Data | ||
## 스트리밍 데이터를 이용한 튜토리얼 | ||
|
||
### Data Refine | ||
### 데이터 정제 | ||
|
||
Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look at [Twitter Credential Setup](https://databricks-training.s3.amazonaws.com/realtime-processing-with-spark-streaming.html#twitter-credential-setup). After you get API keys, you should fill out credential related values(`apiKey`, `apiSecret`, `accessToken`, `accessTokenSecret`) with your API keys on following script. | ||
|
||
This will create a RDD of `Tweet` objects and register these stream data as a table: | ||
이 튜토리얼은 트위터의 샘플 트윗 스트림을 기반으로 하기때문에, 트위터 계정으로 인증이 되어야합니다. 인증하기 위해, [Twitter Credential Setup](https://databricks-training.s3.amazonaws.com/realtime-processing-with-spark-streaming.html#twitter-credential-setup)을 참조합니다. API 키를 받은 후, 아래 스크립트에 자격 증명 관련 값(`apiKey`, `apiSecret`, `accessToken`, `accessTokenSecret`)을 API 키로 채워야합니다. | ||
아래 스크립트는 Tweet 객체의 RDD를 생성하고, 스트림 데이터를 테이블로 등록합니다. | ||
|
||
```scala | ||
import org.apache.spark.streaming._ | ||
|
@@ -95,7 +94,7 @@ import org.apache.log4j.Logger | |
import org.apache.log4j.Level | ||
import sys.process.stringSeqToProcess | ||
|
||
/** Configures the Oauth Credentials for accessing Twitter */ | ||
/** 트위터에 접근하기 위한 Oauth 자격 증명을 구성 */ | ||
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) { | ||
val configs = new HashMap[String, String] ++= Seq( | ||
"apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret) | ||
|
@@ -111,7 +110,7 @@ def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: | |
println() | ||
} | ||
|
||
// Configure Twitter credentials | ||
// 트위터 자격증명 구성 | ||
val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx" | ||
val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | ||
val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | ||
|
@@ -127,9 +126,9 @@ case class Tweet(createdAt:Long, text:String) | |
twt.map(status=> | ||
Tweet(status.getCreatedAt().getTime()/1000, status.getText()) | ||
).foreachRDD(rdd=> | ||
// Below line works only in spark 1.3.0. | ||
// For spark 1.1.x and spark 1.2.x, | ||
// use rdd.registerTempTable("tweets") instead. | ||
// B아래 코드는 spark 1.3.0에서만 작동합니다. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 띄어쓰기: |
||
// spark 1.1.x and spark 1.2.x 에서는 | ||
// rdd.registerTempTable("tweets") 을 사용해야합니다. | ||
rdd.toDF().registerAsTable("tweets") | ||
) | ||
|
||
|
@@ -138,24 +137,24 @@ twt.print | |
ssc.start() | ||
``` | ||
|
||
### Data Retrieval | ||
### 데이터 검색 | ||
|
||
For each following script, every time you click run button you will see different result since it is based on real-time data. | ||
아래 각 스크립트는 실시간 데이터를 기반으로하기때문에 실행 버튼을 클릭할 때마다 다른 결과값을 출력합니다. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 맞춤법 오류: |
||
|
||
Let's begin by extracting maximum 10 tweets which contain the word **girl**. | ||
단어 **girl**을 포함하는 최대 10개의 트윗을 추출해봅시다. | ||
|
||
```sql | ||
%sql select * from tweets where text like '%girl%' limit 10 | ||
``` | ||
|
||
This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run: | ||
지난 60초 동안 초당 얼마나 많은 트윗이 생성되었는지 확인해봅시다. | ||
|
||
```sql | ||
%sql select createdAt, count(1) from tweets group by createdAt order by createdAt | ||
``` | ||
|
||
|
||
You can make user-defined function and use it in Spark SQL. Let's try it by making function named `sentiment`. This function will return one of the three attitudes( positive, negative, neutral ) towards the parameter. | ||
또한, 사용자 정의 함수를 만들어서 스파크 SQL에서 사용할 수도 있습니다. `sentiment`라는 함수를 만들어서 연습해봅시다. 이 함수는 파라미터에 대하여 세 가지 속성(긍정, 부정, 중립) 중 하나를 반환합니다. | ||
|
||
```scala | ||
def sentiment(s:String) : String = { | ||
|
@@ -184,14 +183,14 @@ def sentiment(s:String) : String = { | |
"neutral" | ||
} | ||
|
||
// Below line works only in spark 1.3.0. | ||
// For spark 1.1.x and spark 1.2.x, | ||
// use sqlc.registerFunction("sentiment", sentiment _) instead. | ||
// 아래 코드는 spark 1.3.0에서만 작동합니다. | ||
// spark 1.1.x and spark 1.2.x 에서는 | ||
// sqlc.registerFunction("sentiment", sentiment _) 을 사용해야합니다. | ||
sqlc.udf.register("sentiment", sentiment _) | ||
|
||
``` | ||
|
||
To check how people think about girls using `sentiment` function we've made above, run this: | ||
위에서 만든 `sentiment` 함수를 사용하여 사람들이 'girl'에 대해 어떻게 생각하는지 확인하기위해 아래를 실행합니다. | ||
|
||
```sql | ||
%sql select sentiment(text), count(1) from tweets where text like '%girl%' group by sentiment(text) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,198 @@ | ||
--- | ||
layout: page | ||
title: "Apache Zeppelin Tutorial" | ||
description: "This tutorial page contains a short walk-through tutorial that uses Apache Spark backend. Please note that this tutorial is valid for Spark 1.3 and higher." | ||
group: quickstart | ||
--- | ||
<!-- | ||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
--> | ||
{% include JB/setup %} | ||
|
||
# Zeppelin Tutorial | ||
|
||
<div id="toc"></div> | ||
|
||
This tutorial walks you through some of the fundamental Zeppelin concepts. We will assume you have already installed Zeppelin. If not, please see [here](../install/install.html) first. | ||
|
||
Current main backend processing engine of Zeppelin is [Apache Spark](https://spark.apache.org). If you're new to this system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin. | ||
|
||
## Tutorial with Local File | ||
|
||
### Data Refine | ||
|
||
Before you start Zeppelin tutorial, you will need to download [bank.zip](http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip). | ||
|
||
First, to transform csv format data into RDD of `Bank` objects, run following script. This will also remove header using `filter` function. | ||
|
||
```scala | ||
|
||
val bankText = sc.textFile("yourPath/bank/bank-full.csv") | ||
|
||
case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer) | ||
|
||
// split each line, filter out header (starts with "age"), and map it into Bank case class | ||
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map( | ||
s=>Bank(s(0).toInt, | ||
s(1).replaceAll("\"", ""), | ||
s(2).replaceAll("\"", ""), | ||
s(3).replaceAll("\"", ""), | ||
s(5).replaceAll("\"", "").toInt | ||
) | ||
) | ||
|
||
// convert to DataFrame and create temporal table | ||
bank.toDF().registerTempTable("bank") | ||
``` | ||
|
||
### Data Retrieval | ||
|
||
Suppose we want to see age distribution from `bank`. To do this, run: | ||
|
||
```sql | ||
%sql select age, count(1) from bank where age < 30 group by age order by age | ||
``` | ||
|
||
You can make input box for setting age condition by replacing `30` with `${maxAge=30}`. | ||
|
||
```sql | ||
%sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age | ||
``` | ||
|
||
Now we want to see age distribution with certain marital status and add combo box to select marital status. Run: | ||
|
||
```sql | ||
%sql select age, count(1) from bank where marital="${marital=single,single|divorced|married}" group by age order by age | ||
``` | ||
|
||
<br /> | ||
## Tutorial with Streaming Data | ||
|
||
### Data Refine | ||
|
||
Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look at [Twitter Credential Setup](https://databricks-training.s3.amazonaws.com/realtime-processing-with-spark-streaming.html#twitter-credential-setup). After you get API keys, you should fill out credential related values(`apiKey`, `apiSecret`, `accessToken`, `accessTokenSecret`) with your API keys on following script. | ||
|
||
This will create a RDD of `Tweet` objects and register these stream data as a table: | ||
|
||
```scala | ||
import org.apache.spark.streaming._ | ||
import org.apache.spark.streaming.twitter._ | ||
import org.apache.spark.storage.StorageLevel | ||
import scala.io.Source | ||
import scala.collection.mutable.HashMap | ||
import java.io.File | ||
import org.apache.log4j.Logger | ||
import org.apache.log4j.Level | ||
import sys.process.stringSeqToProcess | ||
|
||
/** Configures the Oauth Credentials for accessing Twitter */ | ||
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) { | ||
val configs = new HashMap[String, String] ++= Seq( | ||
"apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret) | ||
println("Configuring Twitter OAuth") | ||
configs.foreach{ case(key, value) => | ||
if (value.trim.isEmpty) { | ||
throw new Exception("Error setting authentication - value for " + key + " not set") | ||
} | ||
val fullKey = "twitter4j.oauth." + key.replace("api", "consumer") | ||
System.setProperty(fullKey, value.trim) | ||
println("\tProperty " + fullKey + " set as [" + value.trim + "]") | ||
} | ||
println() | ||
} | ||
|
||
// Configure Twitter credentials | ||
val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx" | ||
val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | ||
val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | ||
val accessTokenSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | ||
configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret) | ||
|
||
import org.apache.spark.streaming.twitter._ | ||
val ssc = new StreamingContext(sc, Seconds(2)) | ||
val tweets = TwitterUtils.createStream(ssc, None) | ||
val twt = tweets.window(Seconds(60)) | ||
|
||
case class Tweet(createdAt:Long, text:String) | ||
twt.map(status=> | ||
Tweet(status.getCreatedAt().getTime()/1000, status.getText()) | ||
).foreachRDD(rdd=> | ||
// Below line works only in spark 1.3.0. | ||
// For spark 1.1.x and spark 1.2.x, | ||
// use rdd.registerTempTable("tweets") instead. | ||
rdd.toDF().registerAsTable("tweets") | ||
) | ||
|
||
twt.print | ||
|
||
ssc.start() | ||
``` | ||
|
||
### Data Retrieval | ||
|
||
For each following script, every time you click run button you will see different result since it is based on real-time data. | ||
|
||
Let's begin by extracting maximum 10 tweets which contain the word **girl**. | ||
|
||
```sql | ||
%sql select * from tweets where text like '%girl%' limit 10 | ||
``` | ||
|
||
This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run: | ||
|
||
```sql | ||
%sql select createdAt, count(1) from tweets group by createdAt order by createdAt | ||
``` | ||
|
||
|
||
You can make user-defined function and use it in Spark SQL. Let's try it by making function named `sentiment`. This function will return one of the three attitudes( positive, negative, neutral ) towards the parameter. | ||
|
||
```scala | ||
def sentiment(s:String) : String = { | ||
val positive = Array("like", "love", "good", "great", "happy", "cool", "the", "one", "that") | ||
val negative = Array("hate", "bad", "stupid", "is") | ||
|
||
var st = 0; | ||
|
||
val words = s.split(" ") | ||
positive.foreach(p => | ||
words.foreach(w => | ||
if(p==w) st = st+1 | ||
) | ||
) | ||
|
||
negative.foreach(p=> | ||
words.foreach(w=> | ||
if(p==w) st = st-1 | ||
) | ||
) | ||
if(st>0) | ||
"positivie" | ||
else if(st<0) | ||
"negative" | ||
else | ||
"neutral" | ||
} | ||
|
||
// Below line works only in spark 1.3.0. | ||
// For spark 1.1.x and spark 1.2.x, | ||
// use sqlc.registerFunction("sentiment", sentiment _) instead. | ||
sqlc.udf.register("sentiment", sentiment _) | ||
|
||
``` | ||
|
||
To check how people think about girls using `sentiment` function we've made above, run this: | ||
|
||
```sql | ||
%sql select sentiment(text), count(1) from tweets where text like '%girl%' group by sentiment(text) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
문체를 통일 부탁드립니다:
실행한다.
->실행합니다.