Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mc agency homepage searcher #74

Draft
wants to merge 74 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 70 commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
a4003d9
Create module and db manager
maxachis Mar 20, 2024
fc02c71
Create util module and place db_manager inside.
maxachis Mar 20, 2024
39a6e87
Add google-api-python-client to requirements.txt
maxachis Mar 20, 2024
df92821
Update execute type hinting
maxachis Mar 20, 2024
7a643e4
Create AgencyInfo dataclass
maxachis Mar 20, 2024
e1bc20b
Create GoogleSearcher class
maxachis Mar 20, 2024
539a292
Create HuggingFaceAPIManager
maxachis Mar 20, 2024
564f829
Create HomepageSearcher and associated dataclasses and support functions
maxachis Mar 20, 2024
60bbea0
Fix bug in upload_file function
maxachis Mar 20, 2024
738268d
Modify search function to return None if Quota exceeded, rather than …
maxachis Mar 20, 2024
7d1bd39
Move environmental retrieval functionality from outside of GoogleSear…
maxachis Mar 20, 2024
929fc90
Add type hinting to search_and_upload
maxachis Mar 21, 2024
76cd17f
Move huggingface_api_manager.py to util folder
maxachis Mar 21, 2024
de67ba7
Update upload_file type hinting
maxachis Mar 21, 2024
1cec6c5
Add docstrings
maxachis Mar 21, 2024
c02a84d
Move get_filename_friendly_timestamp to miscellaneous_functions.py
maxachis Mar 21, 2024
ef48e49
Add executemany method
maxachis Mar 25, 2024
047925c
Add docstrings to some methods
maxachis Mar 25, 2024
a2fc866
Add search cache logic
maxachis Mar 25, 2024
dc0c66d
Remove local import of load_dotenv
maxachis Mar 25, 2024
ac72bc7
Update requirements.txt
maxachis Mar 25, 2024
ed0f6ce
Modify executemany
maxachis Mar 25, 2024
a40cd38
Modify SQL statmeents
maxachis Mar 25, 2024
4827f8c
Fix bug in write_to_temporary_csv
maxachis Mar 25, 2024
10ea361
Add clarifying print statements
maxachis Mar 25, 2024
f70ba3f
Set default max searches in main script
maxachis Mar 25, 2024
7b84daf
Add logic for handling search errors in the middle of a search
maxachis Mar 25, 2024
a67d171
Refined logic for catching when quota exceeded.
maxachis Mar 25, 2024
aca3041
Merge remote-tracking branch 'origin/main' into mc_agency_homepage_se…
maxachis Mar 25, 2024
57bbf71
Add utf-8 encoding and handle exceptions in CSV writing
maxachis Mar 31, 2024
dac99e0
Extract main function from homepage_searcher to main.py
maxachis Mar 31, 2024
ae73e5a
Add string generation method in agency_info.py
maxachis Mar 31, 2024
7b20b3e
Refactor GoogleSearcher class and error handling
maxachis Mar 31, 2024
c372d66
Refactor homepage_searcher.py and improve error handling
maxachis Mar 31, 2024
4e1832f
Update error message in google_searcher.py
maxachis Mar 31, 2024
540701e
Add unit tests for GoogleSearcher in agency_homepage_searcher module
maxachis Mar 31, 2024
e2fe90d
Update print statement after search completion
maxachis Mar 31, 2024
f7a7665
Refactor exception handling, search process and added upload to Huggi…
maxachis Apr 1, 2024
42d0821
Expand unit tests for `homepage_searcher` and `google_searcher`
maxachis Apr 1, 2024
ec90bfa
Refined error message in homepage_searcher module
maxachis Apr 1, 2024
f41573f
Simplified test_agency_homepage_searcher_unit.py imports
maxachis Apr 1, 2024
a72304b
Add test_agency_homepage_searcher_integration.py
maxachis Apr 1, 2024
3d629dc
Fix import issues in agency_homepage_searcher
maxachis Apr 1, 2024
585f79a
Refine search string generation in agency_info.py
maxachis Apr 2, 2024
d9b3758
Update pytest module and psycopg dependency in requirements.txt
maxachis Apr 2, 2024
97e859f
Add SearchResultEnum and update SQL query in homepage_searcher.py
maxachis Apr 2, 2024
96ed368
Replace psycopg2 with psycopg in db_manager.py
maxachis Apr 2, 2024
88f08d3
Add requirements for agency_homepage_searcher action
maxachis Apr 2, 2024
5cf2b86
Add pytest_postgresql integration
maxachis Apr 2, 2024
71966ac
Refactor unit test and add test for character stripping
maxachis Apr 2, 2024
947662f
Add README.md file for Agency Homepage Searcher module
maxachis Apr 2, 2024
8250d69
Create blank agency_homepage_searcher.yaml.
maxachis Apr 2, 2024
ff1165b
Improve search status handling and cache updating
maxachis Apr 10, 2024
9390aff
Refactor agency validation in search cache update
maxachis Apr 10, 2024
ad0a5ab
Add SearchResultEnum to SearchResults in unit tests
maxachis Apr 10, 2024
4ac8e4f
Handle case when no 'items' in Google search result
maxachis Apr 10, 2024
69855a3
Improve error handling in homepage search results
maxachis Apr 10, 2024
847437d
Refactor homepage searcher tests and cache update
maxachis Apr 10, 2024
2b3508e
Merge branch 'main' into mc_agency_homepage_searcher
maxachis Apr 10, 2024
014bdc5
Update package requirement psycopg2-binary to psycopg[binary]
maxachis Apr 10, 2024
3b511d4
Add daily run GitHub action for agency homepage searcher
maxachis Apr 10, 2024
5ccf4c4
Update psycopg[binary] version in requirements
maxachis Apr 10, 2024
a140e56
Revise workflow file for agency_homepage_searcher action
maxachis Apr 10, 2024
7c851a5
Remove redundant huggingface-hub package.
maxachis Apr 12, 2024
b2346be
Refactor state name lookup using USStateReference class
maxachis Apr 12, 2024
5668c8a
Remove pytest-postgresql dependency from requirements
maxachis Apr 12, 2024
b63abc0
Update instructions for API key and ID in README.md
maxachis Apr 12, 2024
c0de6fe
Update huggingface-hub version in requirements.
maxachis Apr 12, 2024
d4185ff
Refactor variable assignment in row mapping.
maxachis Apr 12, 2024
57465b0
Update test methods in HomepageSearcher test.
maxachis Apr 12, 2024
7f7ce7d
Remove unused imports in test_agency_homepage_searcher_integration
maxachis Apr 16, 2024
dcfbfbc
Refactor DBManager to use database URL
maxachis Apr 16, 2024
e846650
Update README.md with script and testing instructions
maxachis Apr 17, 2024
5213701
Update search cache retrieval and update methods to use search cache …
maxachis Jun 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .github/actions/agency_homepage_searcher.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: Run Script Daily
on:
schedule:
- cron: '0 0 * * *' # Run daily at 00:00

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.11.3'
- name: Install Dependencies
run: |
python -m pip install --upgrade pip
pip install -r agency_homepage_searcher/requirements_agency_homepage_searcher_action.txt
- name: Run Agency Homepage Searcher
run: |
python agency_homepage_searcher/main.py
env:
CUSTOM_SEARCH_API_KEY: ${{ secrets.CUSTOM_SEARCH_API_KEY }}
CUSTOM_SEARCH_ENGINE_ID: ${{ secrets.CUSTOM_SEARCH_ENGINE_ID }}
DIGITAL_OCEAN_DB_USERNAME: ${{ secrets.DIGITAL_OCEAN_DB_USERNAME }}
DIGITAL_OCEAN_DB_PASSWORD: ${{ secrets.DIGITAL_OCEAN_DB_PASSWORD }}
DIGITAL_OCEAN_DB_HOST: ${{ secrets.DIGITAL_OCEAN_DB_HOST }}
DIGITAL_OCEAN_DB_PORT: ${{ secrets.DIGITAL_OCEAN_DB_PORT }}
DIGITAL_OCEAN_DB_NAME: ${{ secrets.DIGITAL_OCEAN_DB_NAME }}
HUGGINGFACE_ACCESS_TOKEN: ${{ secrets.HUGGINGFACE_ACCESS_TOKEN }}
144 changes: 144 additions & 0 deletions Tests/test_agency_homepage_searcher_integration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
import csv
from typing import List
from unittest.mock import MagicMock

import pytest
from pytest_postgresql import factories
maxachis marked this conversation as resolved.
Show resolved Hide resolved

from agency_homepage_searcher.agency_info import AgencyInfo
from agency_homepage_searcher.google_searcher import GoogleSearcher
from agency_homepage_searcher.homepage_searcher import HomepageSearcher, SearchResults
from util.db_manager import DBManager
from util.huggingface_api_manager import HuggingFaceAPIManager

FAKE_SEARCH_ROW_COUNT = 10


@pytest.fixture
def google_searcher(mocker):
api_key = "test_api_key"
cse_id = "test_cse_id"
mock_service = mocker.patch("agency_homepage_searcher.google_searcher.build")

# Create a mock for the Google API service object and set it as the return_value for the 'build' method
mock_google_api_service = mocker.Mock()
mock_service.return_value = mock_google_api_service
return GoogleSearcher(api_key, cse_id)


def get_fake_agency_info() -> AgencyInfo:
"""
Retr
Returns:

"""
return AgencyInfo(
agency_name="Agency Police Agency",
city="Cityopolis",
state="PA", # Must be an actual state because it is put in the STATE_ISO_TO_NAME_DICT in homepage_searcher.py
county="Horborgor",
zip_code="31415",
website=None,
agency_type="Police Agency",
agency_id="abcdefghijklmnop"
)


def convert_agency_info_to_list(agency_info: AgencyInfo) -> list:
return [
agency_info.agency_name, # 0
agency_info.agency_type, # 1
agency_info.state, # 2
agency_info.city, # 3
agency_info.county, # 4
agency_info.agency_id, # 5
agency_info.website, # 6
agency_info.zip_code # 7
]


def validate_search_query(query_string):
agency_info_list = convert_agency_info_to_list(get_fake_agency_info())
for item in agency_info_list:
if item is None:
continue
assert item in query_string, f"Item {item} not found in query string {query_string}"


def validate_update_search_cache(search_results: list[SearchResults]):
agency_id = get_fake_agency_info().agency_id
assert len(search_results) == 1
assert agency_id == search_results[0].agency_id, f"Agency ID {agency_id} not in expected argument ({search_results[0].agency_id})"


def mock_database_query(query_string):
return convert_agency_info_to_list(get_fake_agency_info())


def mock_search(q, cx):
# Validate query is correct
validate_search_query(q)

# Return fake data
return get_fake_search_data()


def get_fake_search_data():
"""
Generate fake search data
Returns:

"""
fake_search_data = {'items': []}
for i in range(1, FAKE_SEARCH_ROW_COUNT + 1):
number = i
# ASCII value of 'a' is 97, so we add i - 1 to it to get the incremental letter
letter = chr(97 + (i - 1) % 26) # Use modulo 26 to loop back to 'a' after 'z'
fake_search_data['items'].append(
{
'link': f'https://www.example.com/{number}',
'snippet': f'This snippet contains the letter {letter}'
}
)
return fake_search_data


def validate_upload_to_huggingface(search_results: List[SearchResults]) -> None:
fake_search_data_list = get_fake_search_data()['items']
fake_agency_id = get_fake_agency_info().agency_id

# Check there is only one search result
assert len(search_results) == 1, "There should be only one search result pass to upload_to_huggingface"
search_result = search_results[0]
assert search_result.agency_id == fake_agency_id, f"Search result agency id should match {fake_agency_id}, is {search_result.agency_id}"
assert len(
search_result.search_results) == FAKE_SEARCH_ROW_COUNT, f"Number of search results should be {FAKE_SEARCH_ROW_COUNT}, is {len(search_result.search_results)}"
for i in range(FAKE_SEARCH_ROW_COUNT):
fake_search_data = fake_search_data_list[i]
possible_homepage_url = search_result.search_results[i]
assert fake_search_data[
'link'] == possible_homepage_url.url, f"Search result link {fake_search_data['link']} should match {possible_homepage_url.url}"
assert fake_search_data[
'snippet'] == possible_homepage_url.snippet, f"Search result snippet {fake_search_data['snippet']} should match {possible_homepage_url.snippet}"


def test_agency_homepage_searcher_integration(monkeypatch, google_searcher):
# Patch Google Searcher so that search call returns fake data
google_searcher.service.cse().list().execute.return_value = get_fake_search_data()

homepage_searcher = HomepageSearcher(
search_engine=google_searcher,
database_manager=MagicMock(spec=DBManager),
huggingface_api_manager=MagicMock(spec=HuggingFaceAPIManager)
)

# Mock methods in homepage searcher that interface with external sources
# update_search_cache - verifies proper IDs
# get_agencies_without_homepage_urls - return list of fake agency info
# upload_to_huggingface - verifies proper search results
homepage_searcher.update_search_cache = validate_update_search_cache
homepage_searcher.get_agencies_without_homepage_urls = lambda: [get_fake_agency_info()]
homepage_searcher.upload_to_huggingface = validate_upload_to_huggingface

homepage_searcher.search_and_upload(1)

Loading