Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy tsv.gz 3X faster than csv.gz #14973

Closed
BohuTANG opened this issue Mar 16, 2024 · 4 comments · Fixed by #14983
Closed

copy tsv.gz 3X faster than csv.gz #14973

BohuTANG opened this issue Mar 16, 2024 · 4 comments · Fixed by #14983
Assignees

Comments

@BohuTANG
Copy link
Member

BohuTANG commented Mar 16, 2024

Summary

From @youngsofun information , copy tsv.gz performance should same as csv.gz.
From the profile, tsv.gz bytes scanned 15GB cost 9 minutes but csv.gz scanned 75GB cost 26 minutes

Env

Databend Cloud: Databend Query v1.2.373-nightly-683a3a22a7(rust-1.77.0-nightly-2024-03-12T08:05:08.707427719Z)
Warehouse Size: small

How to reproduce

Table:

create or replace  table hits
(
    WatchID BIGINT NOT NULL,
    JavaEnable SMALLINT NOT NULL,
    Title TEXT ,
    GoodEvent SMALLINT NOT NULL,
    EventTime TIMESTAMP NOT NULL,
    EventDate Date NOT NULL,
    CounterID INTEGER NOT NULL,
    ClientIP INTEGER NOT NULL,
    RegionID INTEGER NOT NULL,
    UserID BIGINT NOT NULL,
    CounterClass SMALLINT NOT NULL,
    OS SMALLINT NOT NULL,
    UserAgent SMALLINT NOT NULL,
    URL TEXT NULL,
    Referer TEXT  NULL,
    IsRefresh SMALLINT NOT NULL,
    RefererCategoryID SMALLINT NOT NULL,
    RefererRegionID INTEGER NOT NULL,
    URLCategoryID SMALLINT NOT NULL,
    URLRegionID INTEGER NOT NULL,
    ResolutionWidth SMALLINT NOT NULL,
    ResolutionHeight SMALLINT NOT NULL,
    ResolutionDepth SMALLINT NOT NULL,
    FlashMajor SMALLINT NOT NULL,
    FlashMinor SMALLINT NOT NULL,
    FlashMinor2 TEXT  NULL,
    NetMajor SMALLINT NOT NULL,
    NetMinor SMALLINT NOT NULL,
    UserAgentMajor SMALLINT NOT NULL,
    UserAgentMinor VARCHAR(255) NOT NULL,
    CookieEnable SMALLINT NOT NULL,
    JavascriptEnable SMALLINT NOT NULL,
    IsMobile SMALLINT NOT NULL,
    MobilePhone SMALLINT NOT NULL,
    MobilePhoneModel TEXT  NULL,
    Params TEXT  NULL,
    IPNetworkID INTEGER NOT NULL,
    TraficSourceID SMALLINT NOT NULL,
    SearchEngineID SMALLINT NOT NULL,
    SearchPhrase TEXT  NULL,
    AdvEngineID SMALLINT NOT NULL,
    IsArtifical SMALLINT NOT NULL,
    WindowClientWidth SMALLINT NOT NULL,
    WindowClientHeight SMALLINT NOT NULL,
    ClientTimeZone SMALLINT NOT NULL,
    ClientEventTime TIMESTAMP NOT NULL,
    SilverlightVersion1 SMALLINT NOT NULL,
    SilverlightVersion2 SMALLINT NOT NULL,
    SilverlightVersion3 INTEGER NOT NULL,
    SilverlightVersion4 SMALLINT NOT NULL,
    PageCharset TEXT  NULL,
    CodeVersion INTEGER NOT NULL,
    IsLink SMALLINT NOT NULL,
    IsDownload SMALLINT NOT NULL,
    IsNotBounce SMALLINT NOT NULL,
    FUniqID BIGINT NOT NULL,
    OriginalURL TEXT  NULL,
    HID INTEGER NOT NULL,
    IsOldCounter SMALLINT NOT NULL,
    IsEvent SMALLINT NOT NULL,
    IsParameter SMALLINT NOT NULL,
    DontCountHits SMALLINT NOT NULL,
    WithHash SMALLINT NOT NULL,
    HitColor CHAR NOT NULL,
    LocalEventTime TIMESTAMP NOT NULL,
    Age SMALLINT NOT NULL,
    Sex SMALLINT NOT NULL,
    Income SMALLINT NOT NULL,
    Interests SMALLINT NOT NULL,
    Robotness SMALLINT NOT NULL,
    RemoteIP INTEGER NOT NULL,
    WindowName INTEGER NOT NULL,
    OpenerName INTEGER NOT NULL,
    HistoryLength SMALLINT NOT NULL,
    BrowserLanguage TEXT  NULL,
    BrowserCountry TEXT  NULL,
    SocialNetwork TEXT  NULL,
    SocialAction TEXT  NULL,
    HTTPError SMALLINT NOT NULL,
    SendTiming INTEGER NOT NULL,
    DNSTiming INTEGER NOT NULL,
    ConnectTiming INTEGER NOT NULL,
    ResponseStartTiming INTEGER NOT NULL,
    ResponseEndTiming INTEGER NOT NULL,
    FetchTiming INTEGER NOT NULL,
    SocialSourceNetworkID SMALLINT NOT NULL,
    SocialSourcePage TEXT  NULL,
    ParamPrice BIGINT NOT NULL,
    ParamOrderID TEXT  NULL,
    ParamCurrency TEXT  NULL,
    ParamCurrencyID SMALLINT NOT NULL,
    OpenstatServiceName TEXT  NULL,
    OpenstatCampaignID TEXT  NULL,
    OpenstatAdID TEXT  NULL,
    OpenstatSourceID TEXT  NULL,
    UTMSource TEXT  NULL,
    UTMMedium TEXT  NULL,
    UTMCampaign TEXT  NULL,
    UTMContent TEXT  NULL,
    UTMTerm TEXT  NULL,
    FromTag TEXT  NULL,
    HasGCLID SMALLINT NOT NULL,
    RefererHash BIGINT NOT NULL,
    URLHash BIGINT NOT NULL,
    CLID INTEGER NOT NULL
);

1. COPY csv.gz

COPY INTO hits.hits FROM 's3://clickhouse-public-datasets/hits_compatible/hits.csv.gz' FILE_FORMAT = (TYPE = 'CSV',COMPRESSION=AUTO);

img_v3_0291_d75a2ce2-f602-41eb-98af-eea7cd94b62g

Total time:
image

2. COPY tsv.gz

COPY INTO hits.hits FROM 's3://clickhouse-public-datasets/hits_compatible/hits.tsv.gz' FILE_FORMAT = (TYPE = 'TSV',COMPRESSION=AUTO);

img_v3_0291_5adf2659-9a59-4b9e-9ecf-0242edf1df9g

Total time
image

@youngsofun
Copy link
Member

@BohuTANG the reason is new impl of CSV use less io prefetching, I will improve it soon.

@BohuTANG
Copy link
Member Author

The another weird is:
tsv: bytes scanned: 15.18GB
csv: bytes scanned: 75.55GB

Does this also related to the io prefetching?

@youngsofun
Copy link
Member

youngsofun commented Mar 16, 2024

@BohuTANG

yes, I've noticed that too. I will check it too, may some bug in progress reporting.

update

new impl incr process_values.bytes with decompressed file size.

update

fix in #14981

@youngsofun
Copy link
Member

fixed in #15043

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants