Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sanitization and validation for files #3

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ url = "2.5.4"
base32 = "0.5.1"
blake3 = "1.5.4"
utoipa = { version = "5.2.0", optional = true }
mime = "0.3"

[dev-dependencies]
pubky = "0.3.0"
Expand Down
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,17 +95,17 @@ Pubky.app models are designed for decentralized content sharing. The system uses

### PubkyAppFile

**Description:** Represents metadata of file uploaded by the user.
**Description:** Represents a file uploaded by the user, containing its metadata, including a reference to the actual blob of the file in `src` property.

**URI:** `/pub/pubky.app/files/:file_id`

| **Field** | **Type** | **Description** | **Validation Rules** |
| -------------- | -------- | --------------------------- | --------------------------- |
| `name` | String | Name of the file. | Required. |
| `created_at` | Integer | Unix timestamp of creation. | Required. |
| `src` | String | File blob URL | Required. |
| `content_type` | String | MIME type of the file. | Required. |
| `size` | Integer | Size of the file in bytes. | Required. Positive integer. |
| **Field** | **Type** | **Description** | **Validation Rules** |
| -------------- | -------- | --------------------------- | ---------------------------------------------- |
| `name` | String | Name of the file. | Required. Must be 1-255 characters |
| `created_at` | Integer | Unix timestamp of creation. | Required. |
| `src` | String | File blob URL | Required. must be a valid URL. Max length 1024 |
| `content_type` | String | MIME type of the file. | Required. Valid IANA mime types |
| `size` | Integer | Size of the file in bytes. | Required. Positive integer. Max size is 10Mb |

**Validation Notes:**

Expand Down
144 changes: 136 additions & 8 deletions src/file.rs
Original file line number Diff line number Diff line change
@@ -1,13 +1,46 @@
use std::str::FromStr;

use crate::{
common::timestamp,
traits::{HasPath, TimestampId, Validatable},
APP_PATH,
};
use mime::Mime;
use serde::{Deserialize, Serialize};

use url::Url;
#[cfg(feature = "openapi")]
use utoipa::ToSchema;

const MIN_NAME_LENGTH: usize = 1;
const MAX_NAME_LENGTH: usize = 255;
const MAX_SRC_LENGTH: usize = 1024;
const MAX_SIZE: i64 = 10 * (1 << 20); // 10 MB

const VALID_MIME_TYPES: &[&str] = &[
"application/javascript",
"application/json",
"application/octet-stream",
"application/pdf",
"application/x-www-form-urlencoded",
"application/xml",
"application/zip",
"audio/mpeg",
"audio/wav",
"image/gif",
"image/jpeg",
"image/png",
"image/svg+xml",
"image/webp",
"multipart/form-data",
"text/css",
"text/html",
"text/plain",
"text/xml",
"video/mp4",
"video/mpeg",
];

/// Represents a file uploaded by the user.
/// URI: /pub/pubky.app/files/:file_id
#[derive(Deserialize, Serialize, Debug, Default, Clone)]
Expand All @@ -31,6 +64,7 @@ impl PubkyAppFile {
content_type,
size,
}
.sanitize()
}
}

Expand All @@ -43,11 +77,66 @@ impl HasPath for PubkyAppFile {
}

impl Validatable for PubkyAppFile {
// TODO: content_type validation.
fn sanitize(self) -> Self {
let name = self.name.trim().chars().take(MAX_NAME_LENGTH).collect();

let sanitized_src = self
.src
.trim()
.chars()
.take(MAX_SRC_LENGTH)
.collect::<String>();

let src = match Url::parse(&sanitized_src) {
Ok(_) => Some(sanitized_src),
Err(_) => None, // Invalid src URL, set to None
};

let content_type = self.content_type.trim().to_string();

Self {
name,
created_at: self.created_at,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we cannot use hash_id. We have a created_at timestamp anyway, therefore it is duplicated information. It's a straightforward method for the user to post 3 times the same .gif without creating 3 times the file on his homeserver or the indexer.

Alternatively, we have name field, that could as well be the file and blob name instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the emphasis on hash_id for? I understand its use case for something like tags, but we don't need that here.
The use case you're mentioning is not solved by having the same hash_id. With the current way, a sane user can create one blob, one file and then use that one file anywhere he wants. Should we stop the ability of someone to create multiple blobs and multiple files? I don't think so. I think it's important to remember we're a proxy here and people can do whatever they want with their homeservers, so while we enable, we shouldn't really look into ways of prohibiting some use cases, no matter how fringe they might sound to us, when there's no problem with having them.

Copy link
Collaborator

@SHAcollision SHAcollision Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not necessarily saying hash_id should go on the /file object only, I am talking about the obvious big benefits of hash_id for the data blob. Are the /blobs and the way Nexus uses them covered by the specs?

There is no use for the timestamp ids.

Consider what happens when in pubky-app I am a shit poster and I reply to 10 users by dropping into the post editor modal the same animated gif to make fun of them.

people can do whatever they want with their homeservers

This is barely an argument as the specs are created with the explicit intention of restricting the way data is written into homeservers into a common set of rules that allow interoperability between social pubky-app clients and indexers. A user can write whatever data he wants in whatever spec breaking way he wants and simply share a URI.

when there's no problem with having them

I think free storage saving for anyone using pubky-app according to specs is really good for such a small change (just hash_id for blobs). Might seem a stupid optimization this early but changing IDs and schemas post-launch will be harder.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For blob ids it does make sense to have hash_ids.

src: src.unwrap_or("".to_string()),
content_type,
size: self.size,
}
}

fn validate(&self, id: &str) -> Result<(), String> {
self.validate_id(id)?;
// TODO: content_type validation.
// TODO: size and other validation.

// Validate name
let name_length = self.name.chars().count();

if !(MIN_NAME_LENGTH..=MAX_NAME_LENGTH).contains(&name_length) {
return Err("Validation Error: Invalid name length".into());
}

// Validate src
if self.src.chars().count() == 0 {
return Err("Validation Error: Invalid src".into());
}
if self.src.chars().count() > MAX_SRC_LENGTH {
return Err("Validation Error: src exceeds maximum length".into());
}

// validate content type
match Mime::from_str(&self.content_type) {
Ok(mime) => {
if !VALID_MIME_TYPES.contains(&mime.essence_str()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declared mime type could be pdf and blob be a exe , right? On click on the browser from pubky-app, the user will download a .exe ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. The user doesn't really have an incentive to lie about this.
Ultimately of course, we can have some security checker that runs the files through some malware detector, but mind you, something like this needs to be implemented on the homeserver first. The same goes for content type and sniffing the actual file type of the blob.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user doesn't really have an incentive to lie about this.

Can you elaborate?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying for normal users there's no need to lie about the content type of their files.
For malicious intent, you don't have to lie about the content type. IMO, If there are to be checks for this, it should be on the homeserver, not here.

return Err("Validation Error: Invalid content type".into());
}
}
Err(_) => {
return Err("Validation Error: Invalid content type".into());
}
}

// Validate size
if self.size <= 0 || self.size > MAX_SIZE {
Copy link
Collaborator

@SHAcollision SHAcollision Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we taking the client declared size in /pubky.app/file/<object> at face value? Is there any way we can actually check the size of the blob? Maybe we should have a blob model with the validation that should be passed after we get the blob response in ingest() and before store_blob() in events/handlers/file.rs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the user has any incentive to lie about the size of the file.
Having said that, it looks close to the content-type issue. Our best check would be to make sure it matches the content-length header returned from the homeserver.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we can write into this specs crate? Should be as part of the /blob validation that probably has to run as a child validation of the /file one ?

return Err("Validation Error: Invalid size".into());
}
Ok(())
}
}
Expand Down Expand Up @@ -97,7 +186,7 @@ mod tests {
fn test_validate_valid() {
let file = PubkyAppFile::new(
"example.png".to_string(),
"/uploads/example.png".to_string(),
"pubky://user_id/pub/pubky.app/blobs/id".to_string(),
"image/png".to_string(),
1024,
);
Expand All @@ -110,7 +199,7 @@ mod tests {
fn test_validate_invalid_id() {
let file = PubkyAppFile::new(
"example.png".to_string(),
"/uploads/example.png".to_string(),
"pubky://user_id/pub/pubky.app/blobs/id".to_string(),
"image/png".to_string(),
1024,
);
Expand All @@ -119,21 +208,60 @@ mod tests {
assert!(result.is_err());
}

#[test]
fn test_validate_invalid_content_type() {
let file = PubkyAppFile::new(
"example.png".to_string(),
"pubky://user_id/pub/pubky.app/blobs/id".to_string(),
"notavalid/content_type".to_string(),
1024,
);
let id = file.create_id();
let result = file.validate(&id);
assert!(result.is_err());
}

#[test]
fn test_validate_invalid_size() {
let file = PubkyAppFile::new(
"example.png".to_string(),
"pubky://user_id/pub/pubky.app/blobs/id".to_string(),
"notavalid/content_type".to_string(),
MAX_SIZE + 1,
);
let id = file.create_id();
let result = file.validate(&id);
assert!(result.is_err());
}

#[test]
fn test_validate_invalid_src() {
let file = PubkyAppFile::new(
"example.png".to_string(),
"not_a_url".to_string(),
"notavalid/content_type".to_string(),
MAX_SIZE + 1,
);
let id = file.create_id();
let result = file.validate(&id);
assert!(result.is_err());
}

#[test]
fn test_try_from_valid() {
let file_json = r#"
{
"name": "example.png",
"created_at": 1627849723,
"src": "/uploads/example.png",
"src": "pubky://user_id/pub/pubky.app/blobs/id",
"content_type": "image/png",
"size": 1024
}
"#;

let file = PubkyAppFile::new(
"example.png".to_string(),
"/uploads/example.png".to_string(),
"pubky://user_id/pub/pubky.app/blobs/id".to_string(),
"image/png".to_string(),
1024,
);
Expand All @@ -143,7 +271,7 @@ mod tests {
let file_parsed = <PubkyAppFile as Validatable>::try_from(&blob, &id).unwrap();

assert_eq!(file_parsed.name, "example.png");
assert_eq!(file_parsed.src, "/uploads/example.png");
assert_eq!(file_parsed.src, "pubky://user_id/pub/pubky.app/blobs/id");
assert_eq!(file_parsed.content_type, "image/png");
assert_eq!(file_parsed.size, 1024);
}
Expand Down
55 changes: 55 additions & 0 deletions src/file_blob.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
use crate::{
traits::{HasPath, HashId},
APP_PATH,
};

use serde::{Deserialize, Serialize};

#[cfg(feature = "openapi")]
use utoipa::ToSchema;

const SAMPLE_SIZE: usize = 2 * 1024;

/// Represents a file uploaded by the user.
/// URI: /pub/pubky.app/files/:file_id
#[derive(Deserialize, Serialize, Debug, Default, Clone)]
#[cfg_attr(feature = "openapi", derive(ToSchema))]
pub struct PubkyAppBlob(pub Vec<u8>);

impl HashId for PubkyAppBlob {
fn get_id_data(&self) -> String {
// Get the start and end samples
let start = &self.0[..SAMPLE_SIZE.min(self.0.len())];
let end = if self.0.len() > SAMPLE_SIZE {
&self.0[self.0.len() - SAMPLE_SIZE..]
} else {
&[]
};

// Combine the samples
let mut combined = Vec::with_capacity(start.len() + end.len());
combined.extend_from_slice(start);
combined.extend_from_slice(end);

base32::encode(base32::Alphabet::Crockford, &combined)
}
}

impl HasPath for PubkyAppBlob {
fn create_path(&self) -> String {
format!("{}blobs/{}", APP_PATH, self.create_id())
}
}

#[cfg(test)]
mod tests {
use super::*;
use crate::traits::HashId;

#[test]
fn test_get_id_data_size_is_smaller_than_sample() {
let blob = PubkyAppBlob(vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10]);
let id = blob.get_id_data();
assert_eq!(id, "041061050R3GG28A");
}
}
1 change: 1 addition & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ mod bookmark;
mod common;
mod feed;
mod file;
mod file_blob;
mod follow;
mod last_read;
mod mute;
Expand Down
Loading