-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return list of expanded URLs instead of just the string of only the first URL #14
Comments
Agreed, this is the right thing. In the current version of the api spec, the urls are a list. The sample data generator doesn't insert multiple urls yet though, so I'll leave this issue open until it does. |
@raindrift, I also noticed two additional/related issues in
text = "https://www.facebook.com www.google.com github.com" # 3 URLS
# current
>>> re.findall(r"https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+", text)
['https://www.facebook.com'] # misses 2 of the 3 URLs
# more robust solution: https://urlextract.readthedocs.io/en/latest/urlextract.html
from urlextract import URLExtract
extractor = URLExtract()
>>> extractor.find_urls(text)
['https://www.facebook.com', 'www.google.com', 'github.com'] # gets all 3 URLS
ranking-challenge/sample_data/data_pull.py Lines 249 to 253 in ae88eb0
embedded_urls = []
urls = row.entities.get("urls", [])
for url in urls:
expanded_url = url.get("expanded_url")
if expanded_url:
embedded_urls.append(expanded_url) |
Hi, I noticed in the line below the code only returns the first expanded URL (
urls[0]
), and returns it as a string. Often, there are multiple URLs, so it would be great if all expanded URLs in any given tweet were returned. See the suggestion below. Thanks!ranking-challenge/sample_data/preprocessing.py
Line 186 in 28b900e
The text was updated successfully, but these errors were encountered: