Skip to content

Commit

Permalink
Batch download implementation (#125)
Browse files Browse the repository at this point in the history
* [Feat] 添加批量下载文件夹内全部文档的功能 (#121)

* Initial support

* Basically working

* Use concurrency

* Clean up

* Integrate batch download into download functionality

* Whoops

* refactor: download documents in batch

* format: tidy the code

* update: batch download guideline

---------

Co-authored-by: Jacket <[email protected]>
  • Loading branch information
Wsine and PRESIDENT810 authored Jul 8, 2024
1 parent 86373fe commit da22642
Show file tree
Hide file tree
Showing 15 changed files with 244 additions and 79 deletions.
47 changes: 37 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Feishu2Md
# feishu2md

[![Golang - feishu2md](https://img.shields.io/github/go-mod/go-version/wsine/feishu2md?color=%2376e1fe&logo=go)](https://go.dev/)
[![Unittest](https://github.com/Wsine/feishu2md/actions/workflows/unittest.yaml/badge.svg)](https://github.com/Wsine/feishu2md/actions/workflows/unittest.yaml)
Expand All @@ -20,13 +20,13 @@
配置文件需要填写 APP ID 和 APP SECRET 信息,请参考 [飞书官方文档](https://open.feishu.cn/document/ukTMukTMukTM/ukDNz4SO0MjL5QzM/get-) 获取。推荐设置为

- 进入飞书[开发者后台](https://open.feishu.cn/app)
- 创建企业自建应用,信息随意填写
- 选择测试企业和人员,创建测试企业,绑定应用,切换至测试版本
- (重要)打开权限管理,云文档,开通所有只读权限
- 「查看、评论和导出文档」权限 `docs:doc:readonly`
- 「查看 DocX 文档」权限 `docx:document:readonly`
- 「查看、评论和下载云空间中所有文件」权限 `drive:drive:readonly`
- 「查看和下载云空间中的文件」权限 `drive:file:readonly`
- 创建企业自建应用(个人版),信息随意填写
- (重要)打开权限管理,开通以下必要的权限(可点击以下链接参考 API 调试台->权限配置字段)
- [获取文档基本信息](https://open.feishu.cn/document/server-docs/docs/docs/docx-v1/document/get),「查看新版文档」权限 `docx:document:readonly`
- [获取文档所有块](https://open.feishu.cn/document/server-docs/docs/docs/docx-v1/document/list),「查看新版文档」权限 `docx:document:readonly`
- [下载素材](https://open.feishu.cn/document/server-docs/docs/drive-v1/media/download),「下载云文档中的图片和附件」权限 `docs:document.media:download`
- [获取文件夹中的文件清单](https://open.feishu.cn/document/server-docs/docs/drive-v1/folder/list)「查看、评论、编辑和管理云空间中所有文件」权限 `drive:file:readonly`
- [获取知识空间节点信息](https://open.feishu.cn/document/server-docs/docs/wiki-v2/space-node/get_node),「查看知识库」权限 `wiki:wiki:readonly`
- 打开凭证与基础信息,获取 App ID 和 App Secret

## 如何使用
Expand Down Expand Up @@ -71,6 +71,20 @@
--appId value Set app id for the OPEN API
--appSecret value Set app secret for the OPEN API
--help, -h show help (default: false)

$ feishu2md dl -h
NAME:
feishu2md download - Download feishu/larksuite document to markdown file

USAGE:
feishu2md download [command options] <url>

OPTIONS:
--output value, -o value Specify the output directory for the markdown files (default: "./")
--dump Dump json response of the OPEN API (default: false)
--batch Download all documents under a folder (default: false)
--help, -h show help (default: false)

```
**生成配置文件**
Expand All @@ -81,15 +95,28 @@
更多的配置选项请手动打开配置文件更改。
**下载为 Markdown**
**下载单个文档为 Markdown**
通过 `feishu2md dl <your feishu docx url>` 直接下载,文档链接可以通过 **分享 > 开启链接分享 > 复制链接** 获得。
通过 `feishu2md dl <your feishu docx url>` 直接下载,文档链接可以通过 **分享 > 开启链接分享 > 互联网上获得链接的人可阅读 > 复制链接** 获得。
示例:
```bash
$ feishu2md dl "https://domain.feishu.cn/docx/docxtoken"
```
**批量下载某文件夹内的全部文档为 Markdown**
此功能暂时不支持Docker版本
通过`feishu2md dl --batch <your feishu folder url>` 直接下载,文件夹链接可以通过 **分享 > 开启链接分享 > 互联网上获得链接的人可阅读 > 复制链接** 获得。
示例:
```bash
$ feishu2md dl --batch -o output_directory "https://domain.feishu.cn/drive/folder/foldertoken"
```
</details>
<details>
Expand Down
18 changes: 10 additions & 8 deletions cmd/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,15 @@ type ConfigOpts struct {

var configOpts = ConfigOpts{}

func handleConfigCommand(opts *ConfigOpts) error {
func handleConfigCommand() error {
configPath, err := core.GetConfigFilePath()
utils.CheckErr(err)
if err != nil {
return err
}

fmt.Println("Configuration file on: " + configPath)
if _, err := os.Stat(configPath); os.IsNotExist(err) {
config := core.NewConfig(opts.appId, opts.appSecret)
config := core.NewConfig(configOpts.appId, configOpts.appSecret)
if err = config.WriteConfig2File(configPath); err != nil {
return err
}
Expand All @@ -31,13 +33,13 @@ func handleConfigCommand(opts *ConfigOpts) error {
if err != nil {
return err
}
if opts.appId != "" {
config.Feishu.AppId = opts.appId
if configOpts.appId != "" {
config.Feishu.AppId = configOpts.appId
}
if opts.appSecret != "" {
config.Feishu.AppSecret = opts.appSecret
if configOpts.appSecret != "" {
config.Feishu.AppSecret = configOpts.appSecret
}
if opts.appId != "" || opts.appSecret != "" {
if configOpts.appId != "" || configOpts.appSecret != "" {
if err = config.WriteConfig2File(configPath); err != nil {
return err
}
Expand Down
122 changes: 98 additions & 24 deletions cmd/download.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import (
"os"
"path/filepath"
"strings"
"sync"

"github.com/88250/lute"
"github.com/Wsine/feishu2md/core"
Expand All @@ -17,29 +18,20 @@ import (
type DownloadOpts struct {
outputDir string
dump bool
batch bool
}

var downloadOpts = DownloadOpts{}
var dlOpts = DownloadOpts{}
var dlConfig core.Config

func handleDownloadCommand(url string, opts *DownloadOpts) error {
func downloadDocument(client *core.Client, ctx context.Context, url string, opts *DownloadOpts) error {
// Validate the url to download
docType, docToken, err := utils.ValidateDownloadURL(url)
utils.CheckErr(err)
docType, docToken, err := utils.ValidateDocumentURL(url)
if err != nil {
return err
}
fmt.Println("Captured document token:", docToken)

// Load config
configPath, err := core.GetConfigFilePath()
utils.CheckErr(err)
config, err := core.ReadConfigFromFile(configPath)
utils.CheckErr(err)

// Create client with context
ctx := context.WithValue(context.Background(), "output", config.Output)

client := core.NewClient(
config.Feishu.AppId, config.Feishu.AppSecret,
)

// for a wiki page, we need to renew docType and docToken first
if docType == "wiki" {
node, err := client.GetWikiNodeInfo(ctx, docToken)
Expand All @@ -48,24 +40,28 @@ func handleDownloadCommand(url string, opts *DownloadOpts) error {
docToken = node.ObjToken
}
if docType == "docs" {
return errors.Errorf("Feishu Docs is no longer supported. Please refer to the Readme/Release for v1_support.")
return errors.Errorf(
`Feishu Docs is no longer supported. ` +
`Please refer to the Readme/Release for v1_support.`)
}

// Process the download
docx, blocks, err := client.GetDocxContent(ctx, docToken)
utils.CheckErr(err)

parser := core.NewParser(ctx)
parser := core.NewParser(dlConfig.Output)

title := docx.Title
markdown := parser.ParseDocxContent(docx, blocks)

if !config.Output.SkipImgDownload {
if !dlConfig.Output.SkipImgDownload {
for _, imgToken := range parser.ImgTokens {
localLink, err := client.DownloadImage(
ctx, imgToken, filepath.Join(opts.outputDir, config.Output.ImageDir),
ctx, imgToken, filepath.Join(opts.outputDir, dlConfig.Output.ImageDir),
)
utils.CheckErr(err)
if utils.CheckErr(err) != nil {
return err
}
markdown = strings.Replace(markdown, imgToken, localLink, 1)
}
}
Expand All @@ -83,7 +79,7 @@ func handleDownloadCommand(url string, opts *DownloadOpts) error {
}
}

if opts.dump {
if dlOpts.dump {
jsonName := fmt.Sprintf("%s.json", docToken)
outputPath := filepath.Join(opts.outputDir, jsonName)
data := struct {
Expand All @@ -103,7 +99,7 @@ func handleDownloadCommand(url string, opts *DownloadOpts) error {

// Write to markdown file
mdName := fmt.Sprintf("%s.md", docToken)
if config.Output.TitleAsFilename {
if dlConfig.Output.TitleAsFilename {
mdName = fmt.Sprintf("%s.md", title)
}
outputPath := filepath.Join(opts.outputDir, mdName)
Expand All @@ -114,3 +110,81 @@ func handleDownloadCommand(url string, opts *DownloadOpts) error {

return nil
}

func downloadDocuments(client *core.Client, ctx context.Context, url string) error {
// Validate the url to download
folderToken, err := utils.ValidateFolderURL(url)
if err != nil {
return err
}
fmt.Println("Captured folder token:", folderToken)

// Error channel and wait group
errChan := make(chan error)
wg := sync.WaitGroup{}

// Recursively go through the folder and download the documents
var processFolder func(ctx context.Context, folderPath, folderToken string) error
processFolder = func(ctx context.Context, folderPath, folderToken string) error {
files, err := client.GetDriveFolderFileList(ctx, nil, &folderToken)
if err != nil {
return err
}
opts := DownloadOpts{outputDir: folderPath, dump: dlOpts.dump, batch: false}
for _, file := range files {
if file.Type == "folder" {
_folderPath := filepath.Join(folderPath, file.Name)
if err := processFolder(ctx, _folderPath, file.Token); err != nil {
return err
}
} else if file.Type == "docx" {
// concurrently download the document
wg.Add(1)
go func(_url string) {
if err := downloadDocument(client, ctx, _url, &opts); err != nil {
errChan <- err
}
wg.Done()
}(file.URL)
}
}
return nil
}
if err := processFolder(ctx, dlOpts.outputDir, folderToken); err != nil {
return err
}

// Wait for all the downloads to finish
go func() {
wg.Wait()
close(errChan)
}()
for err := range errChan {
return err
}
return nil
}

func handleDownloadCommand(url string) error {
// Load config
configPath, err := core.GetConfigFilePath()
if err != nil {
return err
}
dlConfig, err := core.ReadConfigFromFile(configPath)
if err != nil {
return err
}

// Instantiate the client
client := core.NewClient(
dlConfig.Feishu.AppId, dlConfig.Feishu.AppSecret,
)
ctx := context.Background()

if dlOpts.batch {
return downloadDocuments(client, ctx, url)
}

return downloadDocument(client, ctx, url, &dlOpts)
}
16 changes: 11 additions & 5 deletions cmd/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ func main() {
},
},
Action: func(ctx *cli.Context) error {
return handleConfigCommand(&configOpts)
return handleConfigCommand()
},
},
{
Expand All @@ -51,22 +51,28 @@ func main() {
Aliases: []string{"o"},
Value: "./",
Usage: "Specify the output directory for the markdown files",
Destination: &downloadOpts.outputDir,
Destination: &dlOpts.outputDir,
},
&cli.BoolFlag{
Name: "dump",
Value: false,
Usage: "Dump json response of the OPEN API",
Destination: &downloadOpts.dump,
Destination: &dlOpts.dump,
},
&cli.BoolFlag{
Name: "batch",
Value: false,
Usage: "Download all documents under a folder",
Destination: &dlOpts.batch,
},
},
ArgsUsage: "<url>",
Action: func(ctx *cli.Context) error {
if ctx.NArg() == 0 {
return cli.Exit("Please specify the document url", 1)
return cli.Exit("Please specify the document/folder url", 1)
} else {
url := ctx.Args().First()
return handleDownloadCommand(url, &downloadOpts)
return handleDownloadCommand(url)
}
},
},
Expand Down
26 changes: 26 additions & 0 deletions core/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import (
"time"

"github.com/chyroc/lark"
"github.com/chyroc/lark_rate_limiter"
)

type Client struct {
Expand All @@ -21,6 +22,7 @@ func NewClient(appID, appSecret string) *Client {
larkClient: lark.New(
lark.WithAppCredential(appID, appSecret),
lark.WithTimeout(60*time.Second),
lark.WithApiMiddleware(lark_rate_limiter.Wait(5, 5)),
),
}
}
Expand Down Expand Up @@ -104,3 +106,27 @@ func (c *Client) GetWikiNodeInfo(ctx context.Context, token string) (*lark.GetWi
}
return resp.Node, nil
}

func (c *Client) GetDriveFolderFileList(ctx context.Context, pageToken *string, folderToken *string) ([]*lark.GetDriveFileListRespFile, error) {
resp, _, err := c.larkClient.Drive.GetDriveFileList(ctx, &lark.GetDriveFileListReq{
PageSize: nil,
PageToken: pageToken,
FolderToken: folderToken,
})
if err != nil {
return nil, err
}
files := resp.Files
for resp.HasMore {
resp, _, err = c.larkClient.Drive.GetDriveFileList(ctx, &lark.GetDriveFileListReq{
PageSize: nil,
PageToken: &resp.NextPageToken,
FolderToken: folderToken,
})
if err != nil {
return nil, err
}
files = append(files, resp.Files...)
}
return files, nil
}
Loading

0 comments on commit da22642

Please sign in to comment.