The startup of extractus #2

SettingDust · 2022-11-29T09:05:38Z

SettingDust
Nov 29, 2022
Maintainer

Should we build Deno and Node.js at the same time? It's should be possible.
How about a monorepo with packages like @extractus/core, @extractus/title-extractor?
Who setting up(I can).

ndaidong · 2022-11-29T11:17:15Z

ndaidong
Nov 29, 2022
Maintainer

@SettingDust At first I think it's better to write for Deno first, then build to native JS code that works on Node.js; However, almost Node.js packages can run on Deno, while Node.js does not support Deno modules. That's what we should consider.

How about a monorepo with packages like @extractus/core, @extractus/title-extractor?

It depends on situation. Some questions to ask yourself.

can you define clearly what @extractus/core provides and what @extractus/title-extractor does?
are you sure that @extractus/core can be re-used by another extractor tool?
is @extractus/title-extractor logic unique? (we have only one way to get title for all kind of content we want to extract)
do they use similar dependencies?

If that's ok, please draft the structure first.

1 reply

SettingDust Nov 29, 2022
Maintainer Author

The monorepo isn't required, but it's not pain in deno. The Deno treat a file as a "package". So what we need is create files in packages folder.

core provide extractor interfaces. xxx-extractor extract a structure (eg. { author: { name: string } }) from a document(or string input for compatibility of other format like json?)
Always possible to abstract a layer for extractors
In my mind, I want to add/delete a extractor to core or add/delete a "operator"(which fetch a string from input) in specific extractor. Such as I don't want to extract author name or want to extract title from a comment
At least they should base on a common helper that for parse inputs.

I think a extractor need a set of "fetcher" that fetch info from input, a "processor" that process the data from "fetcher", a "picker" pick the best result from processed data.

See this: https://github.com/SettingDust/article-extractor/blob/main/src/title-extractor.ts
operators get from different sources then processor remove redundant spaces and split the title. At last the selector pick the first value.

And in url extractor https://github.com/SettingDust/article-extractor/blob/main/src/url-extractor.ts#L68-L75.
We need compare the url and title to select the best url.

Another problem I'm thinking is should we make extractors just like @extractus /meta-tag-extractor, @extractus/jsonld-extractor instead of @extractus/title-extractor.
Pro is a package need less dependencies and we can add package like readability as @extractus/readability-extractor.
Con is the selector and processor have to implement standalone that more abstract. May not a con? I love abstract xd

ndaidong · 2022-11-29T15:03:44Z

ndaidong
Nov 29, 2022
Maintainer

@SettingDust thanks. I've cloned your repo to investigate. I found it a bit messy 😄

Let me update the documentation for the repos first. In the meantime, could you sketch out an overall structure you have in your mind? To be honest, I still haven't figured out what you mean.

I recommend a few criteria:

KISS - code base should be simple, clean
Functional Programming approach
avoid recursive dependency and overlapping function calls
external libraries should be wrapped for easy changes later

Even using TS, we should focus on npm ecosystem first because others are running after it. But I like Deno test/lint/fmt so we may not need jest, eslint, standard, etc.

0 replies

SettingDust · 2022-11-30T13:02:10Z

SettingDust
Nov 30, 2022
Maintainer Author

It's so difficult to describe what I'm constructing. Tried times and times xdddd. My English is so poor when I try to describe what in my mind. Coding so easy :(

Extractor

Extractors should be passed to extract function in @extractus/core as an option. There should be a package @extractus/recommend that hold part of extractors with our recommend sort can pass as extract option

Example for how to implement a opengraph-extractor.
The open graph is meta tag looks like <meta property="og:title" content="The Rock" />.

We need parse the input string to a Document. But we don't want to parse document again and again. So I think we need a memoize lib such as https://www.npmjs.com/package/nano-memoize cache the Document. To do that, we have to provide helpers for parsing string to Document, json to object and so on.
We get the result from the input correctly. But sometimes we get more than one result. So we have to make the first steps return a string[].
A extractor named opengraph-extractor should return a object that with metas from all open graph tags. So I think a object returned by a extractor should be like

{
  title: ['Foo', 'Bar']
  url: ['https://foo.bar', 'ftp://example.com/wuwuwu']
}

Processor

Processor should be registered in a object like below. After get the values from extractors. We should deep merge the extracted objects and input to corresponding processor

{
  title: (values) => values.filter(notBlank).map(trim).flatMap(splitWithSeparator)
  url: (values, context) => values.filter(isUrl).map(trim).map(normalize)
}

Picker

Picker should pick one value from processed values

{
  title: (values) => values[0]
  url: (values, context) => closest(context.title, values) ?? values[0] ?? context.url
  date: {
    published: (values) => firstParsableDate(values)
  }
}

1 reply

SettingDust Dec 1, 2022
Maintainer Author

I want to add a @extractus/css-selector-extractor lib for custom rules and as an abstract layer for the extractors need selector such as opgngraph

ndaidong · 2022-11-30T14:45:39Z

ndaidong
Nov 30, 2022
Maintainer

@SettingDust Thank you for your effort. I have updated the information on the repos. Now I will read your ideas carefully.

0 replies

ndaidong · 2022-12-01T08:24:02Z

ndaidong
Dec 1, 2022
Maintainer

@SettingDust

At first I didn't quite understand what you mean, specifically what "core" you mentioned. Now it seems I've got your point. Basically we're not talking about the same thing :)

When I look at the libraries I have developed (article-parser, feed-reader, oembed-parser), I see each one do a different task and have different business logic.

article-parser: from URL --> get HTML --> DOM Parser + Readability --> article data
feed-reader: from URL --> get XML --> XML Parser --> feed data
oembed-parser: from URL --> get JSON --> find API andpoint --> call to oembed API --> oembed data

There is very little in common between them, none of which is large enough to split into a shared core.

So I don't get what you mean by core.

But after reading your comment, and digging through the source code of @settingdust/article-extractor, I figured out why you'd want a core.

All the business logic in your library runs around an HTML Document. During the process, your extractors come back to this input over and over again. That's why you need memoization (which should be carefully used with timeout. Imagine you are running an article extractor service with several hundred thousand of requests per day, how much RAM will you need?).

Your program design is heavily technical, not easy to understand data flow. This design style seems to be influenced by the old OOP paradigms in Java or Objective-C, which are quite heavy and complex.

I prefer something more lighweight, such as Function Composition in Functional Programming or middlewares / pipelines approach in ExpressJS.

In @extractus/article-extractor, there is only functions or modules with several functions. The data flow is very simple:

Input --> [Function 1, Function 2, ... Function N] --> Output

Maybe I'm not a FP guru yet, but at least it's easy for others to read, understand, and contribute.

@extractus/article-extractor can be improved in some way, I'm not sure. But it should be neat, simple, straightforward, following the minimalist mindset.

For example, I tend to re-design your lib @settingdust/article-extractor as below:

Then someone can use it by mixing the built-in methods and his own "plugins":

import { extract, extractArticle, extractAuthor, extractDate } from '@settingdust/article-extractor'

const defaultOptions = {
  sanitize: {...},
  lang: 'en',
}

const f = await fetch(someUrl)
const html = await f.text()

const extractAddress = (input) => {
  return parseAddressFromDom(input.DOM)
}

const result = await extract({
  html,
  url: someUrl,
  ...defaultOptions,
}, [
  extractArticle,
  extractAuthor,
  extractDate,
  extractAddress,
])

What do you think about this design?

0 replies

SettingDust · 2022-12-01T08:39:05Z

SettingDust
Dec 1, 2022
Maintainer Author

Pipeline is good. And what I'm designing is handle the data like a pipeline in my mind.
The @settingdst/article-extractor is complex I agree. When I notice that, I considered refactor like below(is the one I mentioned above comments). But I give up xd

Input string -> parse to Document/JSON/XML -> [extractor] -> [processor] -> [picker] -> deep merged object

0 replies

ndaidong · 2022-12-01T08:46:18Z

ndaidong
Dec 1, 2022
Maintainer

@SettingDust yes, go ahead, try to refactor your lib by that way as an experiment. Add tests and build for multi platforms.
For cross-platform approach, I'm looking at honojs/hono and like the way they organize source code. Maybe you too.

6 replies

SettingDust Dec 1, 2022
Maintainer Author

I'll try to refactor the lib with this model.
The deno to node I'll learn too.

ndaidong Dec 1, 2022
Maintainer

Java is cool! I see that you have a sharp programming mind which is trained well from that environment.
I almost work with JS, Python and Golang. Actually I didn't like TS. I'm afraid to break away from pure JavaScript. JSDocs may be safer to get same result. But Deno has such a great ecosystem. Deno Deploy is fast and easy to use. So I will apply TS in moderation.

SettingDust Dec 1, 2022
Maintainer Author

I don't want to use too many specified types in typescript. I love it's type inferring. Just like what I did in my lib. The type of extract is inferred from the deep merged extractors extracted data.
Hardcoded type abandon the "dynamic" of javascript and is complex when try to change some var/arg/param have to modify the type declaration.

ndaidong Dec 1, 2022
Maintainer

@SettingDust we may not need TS once this proposal goes to stage 3. It was stage 0 on March, and now it's stage 1, quite fast. According to TC39 process, it often takes a year to reach stage 4. But usually the transpillers will start implementing earlier.

SettingDust Dec 1, 2022
Maintainer Author

Deno won't be javascript I think xd

SettingDust · 2022-12-02T04:39:09Z

SettingDust
Dec 2, 2022
Maintainer Author

Is there a standard or lib that can get attribute value of element like css selector?

5 replies

ndaidong Dec 2, 2022
Maintainer

@SettingDust do you mean HTML element on browser?

SettingDust Dec 2, 2022
Maintainer Author

@SettingDust do you mean HTML element on browser?

Yes. I've write one by myself use the scrapy syntax

ndaidong Dec 2, 2022
Maintainer

@SettingDust I often use this method

const attr = domElement.getAttribute('attributeName')

SettingDust Dec 2, 2022
Maintainer Author

@SettingDust I often use this method
const attr = domElement.getAttribute('attributeName')

I want to add user defined rules lul. I have to parse string. I've done

ndaidong Dec 2, 2022
Maintainer

great, if you allow complex query rules, you may need do more. Just take care performance.

SettingDust · 2022-12-03T07:44:24Z

SettingDust
Dec 3, 2022
Maintainer Author

Why can't we use node coding then compile for deno

11 replies

SettingDust Dec 3, 2022
Maintainer Author

I used deno for simple binary tool for self use. Haven't used deno for a project

SettingDust Dec 3, 2022
Maintainer Author

Yup. Node has fetch(https://undici.nodejs.org) and test now.

ndaidong Dec 3, 2022
Maintainer

@SettingDust undici benchmarks is so attractive. However if you are running node 18, you can use fetch native. Just captured this screenshot to ask on Deno group.

SettingDust Dec 3, 2022
Maintainer Author

undici is used by node's fetch

ndaidong Dec 3, 2022
Maintainer

oh, that sounds good, high performance expected :)

ndaidong · 2022-12-03T08:04:38Z

ndaidong
Dec 3, 2022
Maintainer

@SettingDust I found that content from some websites could not be loaded with Deno. Did you faced this issue?

It seems relate to fetch api and http2, not sure how to fix. These resources have no problem with Node.js:

https://readfeed.deta.dev/?url=https://pwshub.com/feed.xml

2 replies

SettingDust Dec 3, 2022
Maintainer Author

Nope

ndaidong Dec 3, 2022
Maintainer

@SettingDust I known these issues since when adding examples. Node.js and Bun are ok, but Deno has this problem. However that's not too bad for demo purpose.

Extractus

The startup of extractus #2

SettingDust Nov 29, 2022 Maintainer

Replies: 10 comments · 26 replies

ndaidong Nov 29, 2022 Maintainer

SettingDust Nov 29, 2022 Maintainer Author

ndaidong Nov 29, 2022 Maintainer

SettingDust Nov 30, 2022 Maintainer Author

Extractor

Processor

Picker

SettingDust Dec 1, 2022 Maintainer Author

ndaidong Nov 30, 2022 Maintainer

ndaidong Dec 1, 2022 Maintainer

So I don't get what you mean by core.

SettingDust Dec 1, 2022 Maintainer Author

ndaidong Dec 1, 2022 Maintainer

SettingDust Dec 1, 2022 Maintainer Author

ndaidong Dec 1, 2022 Maintainer

SettingDust Dec 1, 2022 Maintainer Author

ndaidong Dec 1, 2022 Maintainer

SettingDust Dec 1, 2022 Maintainer Author

SettingDust Dec 2, 2022 Maintainer Author

ndaidong Dec 2, 2022 Maintainer

SettingDust Dec 2, 2022 Maintainer Author

ndaidong Dec 2, 2022 Maintainer

SettingDust Dec 2, 2022 Maintainer Author

ndaidong Dec 2, 2022 Maintainer

SettingDust Dec 3, 2022 Maintainer Author

SettingDust Dec 3, 2022 Maintainer Author

SettingDust Dec 3, 2022 Maintainer Author

ndaidong Dec 3, 2022 Maintainer

SettingDust Dec 3, 2022 Maintainer Author

ndaidong Dec 3, 2022 Maintainer

ndaidong Dec 3, 2022 Maintainer

SettingDust Dec 3, 2022 Maintainer Author

ndaidong Dec 3, 2022 Maintainer

SettingDust
Nov 29, 2022
Maintainer

Replies: 10 comments 26 replies

ndaidong
Nov 29, 2022
Maintainer

SettingDust Nov 29, 2022
Maintainer Author

ndaidong
Nov 29, 2022
Maintainer

SettingDust
Nov 30, 2022
Maintainer Author

SettingDust Dec 1, 2022
Maintainer Author

ndaidong
Nov 30, 2022
Maintainer

ndaidong
Dec 1, 2022
Maintainer

SettingDust
Dec 1, 2022
Maintainer Author

ndaidong
Dec 1, 2022
Maintainer

SettingDust Dec 1, 2022
Maintainer Author

ndaidong Dec 1, 2022
Maintainer

SettingDust Dec 1, 2022
Maintainer Author

ndaidong Dec 1, 2022
Maintainer

SettingDust Dec 1, 2022
Maintainer Author

SettingDust
Dec 2, 2022
Maintainer Author

ndaidong Dec 2, 2022
Maintainer

SettingDust Dec 2, 2022
Maintainer Author

ndaidong Dec 2, 2022
Maintainer

SettingDust Dec 2, 2022
Maintainer Author

ndaidong Dec 2, 2022
Maintainer

SettingDust
Dec 3, 2022
Maintainer Author

SettingDust Dec 3, 2022
Maintainer Author

SettingDust Dec 3, 2022
Maintainer Author

ndaidong Dec 3, 2022
Maintainer

SettingDust Dec 3, 2022
Maintainer Author

ndaidong Dec 3, 2022
Maintainer

ndaidong
Dec 3, 2022
Maintainer

SettingDust Dec 3, 2022
Maintainer Author

ndaidong Dec 3, 2022
Maintainer