The startup of extractus #2
Replies: 10 comments 26 replies
-
@SettingDust At first I think it's better to write for Deno first, then build to native JS code that works on Node.js; However, almost Node.js packages can run on Deno, while Node.js does not support Deno modules. That's what we should consider.
It depends on situation. Some questions to ask yourself.
If that's ok, please draft the structure first. |
Beta Was this translation helpful? Give feedback.
-
@SettingDust thanks. I've cloned your repo to investigate. I found it a bit messy 😄 Let me update the documentation for the repos first. In the meantime, could you sketch out an overall structure you have in your mind? To be honest, I still haven't figured out what you mean. I recommend a few criteria:
Even using TS, we should focus on npm ecosystem first because others are running after it. But I like Deno test/lint/fmt so we may not need jest, eslint, standard, etc. |
Beta Was this translation helpful? Give feedback.
-
It's so difficult to describe what I'm constructing. Tried times and times xdddd. My English is so poor when I try to describe what in my mind. Coding so easy :( ExtractorExtractors should be passed to Example for how to implement a
{
title: ['Foo', 'Bar']
url: ['https://foo.bar', 'ftp://example.com/wuwuwu']
} ProcessorProcessor should be registered in a object like below. After get the values from extractors. We should deep merge the extracted objects and input to corresponding processor {
title: (values) => values.filter(notBlank).map(trim).flatMap(splitWithSeparator)
url: (values, context) => values.filter(isUrl).map(trim).map(normalize)
} PickerPicker should pick one value from processed values {
title: (values) => values[0]
url: (values, context) => closest(context.title, values) ?? values[0] ?? context.url
date: {
published: (values) => firstParsableDate(values)
}
} |
Beta Was this translation helpful? Give feedback.
-
@SettingDust Thank you for your effort. I have updated the information on the repos. Now I will read your ideas carefully. |
Beta Was this translation helpful? Give feedback.
-
At first I didn't quite understand what you mean, specifically what "core" you mentioned. Now it seems I've got your point. Basically we're not talking about the same thing :) When I look at the libraries I have developed (
There is very little in common between them, none of which is large enough to split into a shared core. So I don't get what you mean by core.But after reading your comment, and digging through the source code of All the business logic in your library runs around an HTML Document. During the process, your extractors come back to this input over and over again. That's why you need memoization (which should be carefully used with timeout. Imagine you are running an article extractor service with several hundred thousand of requests per day, how much RAM will you need?). Your program design is heavily technical, not easy to understand data flow. This design style seems to be influenced by the old OOP paradigms in Java or Objective-C, which are quite heavy and complex. I prefer something more lighweight, such as Function Composition in Functional Programming or middlewares / pipelines approach in ExpressJS. In
Maybe I'm not a FP guru yet, but at least it's easy for others to read, understand, and contribute.
For example, I tend to re-design your lib Then someone can use it by mixing the built-in methods and his own "plugins": import { extract, extractArticle, extractAuthor, extractDate } from '@settingdust/article-extractor'
const defaultOptions = {
sanitize: {...},
lang: 'en',
}
const f = await fetch(someUrl)
const html = await f.text()
const extractAddress = (input) => {
return parseAddressFromDom(input.DOM)
}
const result = await extract({
html,
url: someUrl,
...defaultOptions,
}, [
extractArticle,
extractAuthor,
extractDate,
extractAddress,
]) What do you think about this design? |
Beta Was this translation helpful? Give feedback.
-
Pipeline is good. And what I'm designing is handle the data like a pipeline in my mind. Input string -> parse to Document/JSON/XML -> [extractor] -> [processor] -> [picker] -> deep merged object |
Beta Was this translation helpful? Give feedback.
-
@SettingDust yes, go ahead, try to refactor your lib by that way as an experiment. Add tests and build for multi platforms. |
Beta Was this translation helpful? Give feedback.
-
Is there a standard or lib that can get attribute value of element like css selector? |
Beta Was this translation helpful? Give feedback.
-
Why can't we use node coding then compile for deno |
Beta Was this translation helpful? Give feedback.
-
@SettingDust I found that content from some websites could not be loaded with Deno. Did you faced this issue? It seems relate to |
Beta Was this translation helpful? Give feedback.
-
@ndaidong
@extractus/core
,@extractus/title-extractor
?Beta Was this translation helpful? Give feedback.
All reactions