This repository has been archived by the owner on Nov 15, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 16
WIP : Jimmyd/impact retour emploi #384
Open
JimmyDore
wants to merge
31
commits into
master
Choose a base branch
from
jimmyd/impact_retour_emploi
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,665
−8
Open
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
841a83d
WIP Add Joris work about impact retour emploi
JimmyDore c51259f
Update setup.py to create executables in virtualenv for ire scripts
JimmyDore 960aebf
wip
JimmyDore 389034d
Industrialize daily copy script
JimmyDore a5f6d0f
Fix scripts launcher
JimmyDore 5589f81
Add logs informations
JimmyDore 7923a64
Add Exception for the daily parser script
JimmyDore f6f36b2
Clean and prepare jobs join & clean activity_logs-dpae for Jenkins
JimmyDore 04902f2
Remove debug mode
JimmyDore d4e8756
Add log about size of DPAE file
JimmyDore c333095
wip make report
JimmyDore 22aaf5f
Fix (approximately) issues with path
JimmyDore adfbcb1
Fix last problem with path
JimmyDore b3693ce
Add settings file with different paths
JimmyDore 17e4c6f
Fix import module charts
JimmyDore 0292e93
Add useful libs to install in DockerFile
JimmyDore a503443
Add xvfb to run imgkit from Docker image
JimmyDore bf21e56
Add comments on main script to make charts and excel report
JimmyDore 5d439e3
Update name of DPAE file to be used
JimmyDore 6ff55df
Add function to parse activity logs for PSE study
JimmyDore 571b82f
Update the way to check if a file needs to be used or not
JimmyDore 5299955
Add option to join data on SIREN (or SIRET as before)
JimmyDore 98a068b
Remove debug mode
JimmyDore cd71c44
Fix import
JimmyDore d2e40a3
Fix check existence of csv generated file
JimmyDore a68b5ed
Fix SIREN issue int/str
JimmyDore 5ab18af
Fix types of columns siren/siret
JimmyDore e9c9653
Fix pandas bug
JimmyDore ccf6a21
Try with SIRET to compare data
JimmyDore fc35a7a
Fix path to dpae file
JimmyDore 43d3b82
Fix siren bug
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
wip
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
110 changes: 75 additions & 35 deletions
110
labonneboite/scripts/impact_retour_emploi/daily_json_activity_parser.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,58 +1,98 @@ | ||
import json | ||
import urllib.parse | ||
import os | ||
import json | ||
import pandas as pd | ||
from sqlalchemy import create_engine | ||
|
||
engine = create_engine('mysql://labonneboite:%s@127.0.0.1:3306/labonneboite' %urllib.parse.quote_plus('LaB@nneB@ite')) | ||
engine.connect() | ||
from labonneboite.importer import util as import_util | ||
from labonneboite.importer import settings as importer_settings | ||
|
||
create_table_query1 = 'CREATE TABLE IF NOT EXISTS `idpe_connect` ( \ | ||
`idutilisateur_peconnect` text, \ | ||
`dateheure` text \ | ||
) ENGINE=InnoDB DEFAULT CHARSET=utf8;' | ||
|
||
create_table_query2 = 'CREATE TABLE IF NOT EXISTS `activity_logs` ( \ | ||
`dateheure` text,\ | ||
`nom` text,\ | ||
`idutilisateur_peconnect` text,\ | ||
`siret` text\ | ||
) ENGINE=InnoDB DEFAULT CHARSET=utf8;' | ||
|
||
last_date_query = 'SELECT dateheure \ | ||
FROM idpe_connect \ | ||
ORDER BY dateheure DESC \ | ||
LIMIT 1' | ||
|
||
con, cur = import_util.create_cursor() | ||
cur.execute(create_table_query1) | ||
cur.execute(create_table_query2) | ||
cur.execute(last_date_query) | ||
row = cur.fetch() | ||
cur.close() | ||
con.close() | ||
|
||
data = [] | ||
|
||
data=[] | ||
# FIXME : Later, we'll be able to get datas, directly from PE datalake | ||
# Now we have a cron task which will cpy json activity logs to /srv/lbb/data | ||
json_logs_folder_path = importer_settings.INPUT_SOURCE_FOLDER | ||
json_logs_paths = os.listdir(json_logs_folder_path) | ||
json_logs_paths = [i for i in json_logs_paths if i.startswith('activity')] | ||
|
||
json_path='/mnt/datalakepe/vers_datalake/activity/' | ||
#json_path='/home/ads/Documents/labonneboite/labonneboite/importer/jobs/scripts_tre/' | ||
liste_files=os.listdir(json_path) | ||
liste_files.sort() | ||
json_files=liste_files[-1] | ||
for json_logs_path in json_logs_paths: | ||
|
||
|
||
with open(json_path+json_files,'r') as json_files: | ||
for line in json_files: | ||
with open(json_path+last_json, 'r') as json_file: | ||
for line in json_file: | ||
data.append(line) | ||
activity_dico={} | ||
i=1 | ||
for line in data: | ||
activity_dico[str(i)]=json.loads(line) | ||
i+=1 | ||
activities = {} | ||
i = 1 | ||
for activity in data: | ||
activities[str(i)] = json.loads(activity) | ||
i += 1 | ||
|
||
activity_df = pd.DataFrame.from_dict(activities).transpose() | ||
|
||
table_activity=pd.DataFrame.from_dict(activity_dico).transpose() | ||
|
||
def idpe_only(row): | ||
if row['idutilisateur-peconnect']==None: | ||
if row['idutilisateur-peconnect'] is None: | ||
return 0 | ||
else : return 1 | ||
return 1 | ||
|
||
|
||
activity_df['tri_idpec'] = activity_df.apply( | ||
lambda row: idpe_only(row), axis=1) | ||
activity_df = activity_df[activity_df.tri_idpec != 0] | ||
activity_idpec = activity_df.drop_duplicates( | ||
subset=['idutilisateur-peconnect'], keep='first') | ||
|
||
table_activity['provisoire']=table_activity.apply(lambda row:idpe_only(row),axis=1) | ||
table_activity=table_activity[table_activity.provisoire != 0] | ||
table_activity_2_bis= table_activity.drop_duplicates(subset=['idutilisateur-peconnect'], keep='first') | ||
activity_idpec = activity_idpec[[ | ||
'dateheure', 'idutilisateur-peconnect']] | ||
activity_idpec.to_sql( | ||
con=engine, name='idpe_connect', if_exists='append', index=False) | ||
|
||
table_activity_2_bis=table_activity_2_bis[['dateheure','idutilisateur-peconnect']] | ||
table_activity_2_bis.to_sql(con=engine, name='idpe_connect', if_exists='append',index=False) | ||
cliks_of_interest = ['details', 'afficher-details', | ||
'telecharger-pdf', 'ajout-favori'] | ||
|
||
cliks_of_interest=['details','afficher-details','telecharger-pdf','ajout-favori'] | ||
|
||
def tri_nom(row): | ||
def tri_names(row): | ||
if row['nom'] in cliks_of_interest: | ||
return True | ||
else : | ||
return False | ||
return False | ||
|
||
|
||
activity_df['tri_names'] = activity_df.apply( | ||
lambda row: tri_names(row), axis=1) | ||
activity_logs = activity_df[activity_df.tri_names is True] | ||
|
||
table_activity['to_tej'] = table_activity.apply(lambda row: tri_nom(row), axis=1) | ||
table_activity_2 = table_activity[table_activity.to_tej == True] | ||
|
||
def siret(row): | ||
return row['proprietes']['siret'] | ||
|
||
table_activity_2['siret'] = table_activity_2.apply(lambda row: siret(row), axis=1) | ||
cols_of_interest=["dateheure","nom","idutilisateur-peconnect","siret"] | ||
table_activity_3=table_activity_2[cols_of_interest] | ||
table_activity_3.to_sql(con=engine, name='activity_logs', if_exists='append',index=False) | ||
|
||
activity_logs['siret'] = activity_logs.apply( | ||
lambda row: siret(row), axis=1) | ||
cols_of_interest = ["dateheure", "nom", "idutilisateur-peconnect", "siret"] | ||
act_logs_good = activity_logs[cols_of_interest] | ||
act_logs_good.to_sql(con=engine, name='activity_logs', | ||
if_exists='append', index=False) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saut de ligne avant car pandas est une lib externe.