Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to load generated_schemas - appear as blank.txt #28

Open
rabsef-bicrym opened this issue Jan 6, 2020 · 20 comments
Open

How to load generated_schemas - appear as blank.txt #28

rabsef-bicrym opened this issue Jan 6, 2020 · 20 comments

Comments

@rabsef-bicrym
Copy link

Load the metadata with from source csv files in generated_schemas with the management command: python manage.py load_metadata

I seem to only have blank.txt in generated_schemas right now. What am I missing in the setup steps to get this properly filled w/ csv files for the step above?

I'm re-following through the guide as my last attempt was unsuccessful. Gotten this far but I haven't seemed to load anything to generated schemas.

@bcipolli
Copy link

bcipolli commented Jan 7, 2020

Having the same issue. I'm hacking things together, as:

  • generated_schemas/emptyhead_blacklist.txt doesn't exist anywhere
  • All csv files exist within irsx, but the METADATA_DIRECTORY has been changed in load_metadata.py without specification on why.

@bcipolli
Copy link

bcipolli commented Jan 7, 2020

Offending commit is here: 0d0ccc9

@bcipolli
Copy link

bcipolli commented Jan 7, 2020

For better or for worse, my workaround is:

  • Switch back to the old METADATA_DIRECTORY
  • Create an empty emptyhead_blacklist.txt in that directory.

Once I do that, I'm so far able to follow setup steps. I'm currently loading 2017 filings into a Postgres database.

@bcipolli
Copy link

bcipolli commented Jan 7, 2020

@jsfenfen Looks like you checked in code for a version of IRSX that hasn't been released. Can you push that code / make that release? Or, can I create a branch here that works around the issue here?

See: https://github.com/jsfenfen/990-xml-reader/releases - there is no 0.2.8 release, as referenced in the commit I linked above.

@rabsef-bicrym
Copy link
Author

@bcipolli I think that's worked for me, mate. Thank you. Incidentally, have you run year 2018, returns from 2017 and gotten a lot of this:

keyerror: [{'schedule_name': 'IRS990', 'group_keyerrors': [], 'keyerrors': [{'element_path': '/IRS990/ContractorCompensationGrp/ContractorAddress'}]}]

@bcipolli
Copy link

@rabsef-bicrym I didn't, but I assume this is the kind of path blacklisting that the system (which was implemented, but not merged) was meant to take care of. I believe adding /IRS990/ContractorCompensationGrp/ContractorAddress to emptyhead_blacklist.txt may avoid these errors, if needed. But, just a hunch at this point.

@rabsef-bicrym
Copy link
Author

Thank you - I really appreciate your insight and assistance.

@bcipolli
Copy link

No problem. I'm walking through the same issues right now as well. :)

@bcipolli
Copy link

This is my final result:
image

This "missing" forms are ones that couldn't be parsed. I'm assuming that's due to the hackiness of the above approach, rather than getting the system to skip blacklisted xpaths. Hopefully @jsfenfen can upload the irsx updates, eliminate the need for this workaround, and get these errors to zero!

@rabsef-bicrym
Copy link
Author

I'm not sure exactly how to check like you have above w/ the num_missing, but I got most of 2018 submission year successfully and I am STOKED. Thanks @bcipolli and @jsfenfen for everything.

@jsfenfen
Copy link
Owner

Hey, apologies for the delay. I should probably rewrite the documentation for that part. Will try to take a look soon, I believe you don't have to run this step at all, I may be misunderstanding.

To @bcipolli's point, I believe the files that are not being entered are in unrecognized versions, generally this means schemas from 2013 or before, but if you do any digging and find that not to be the case I'd love to know what's causing it.

The "keyerrors" generally represent incorrect xpaths, @bcipolli is on target that there was a list of emptyheads that get blacklisted, though their existence is an artifact of some less-than-perfect code upstream.

@bcipolli: I can push missing versions, there was one version I suppressed b/c it had a bug, I think.

@bcipolli
Copy link

bcipolli commented Jan 16, 2020

Thanks @jsfenfen !

@rabsef-bicrym - here's the query I ran to get that result:

with data as (
select 
    filing.submission_year,
    --filing.return_type,
    filing.parse_complete,
	count(*) as num_filings,
	count(f1040.object_id) as num_1040,
	count(ez.object_id) as num_ez,
	count(pf.object_id) as num_pf
from public.filing_filing  filing 
left join return_part_0 f1040
	on f1040.object_id = filing.object_id
left join return_ez_part_0 ez 
	on ez.object_id = filing.object_id
left join return_pf_part_0 pf 
	on pf.object_id = filing.object_id
group by 
	filing.submission_year,
	--filing.return_type,
	filing.parse_complete
)
select 
	*, 
	num_filings - num_1040 - num_ez - num_pf as num_missing
from data 
where 
	num_filings != num_1040+num_ez+num_pf

@Georg-coder
Copy link

Georg-coder commented Jan 30, 2020

I have the same problem. There is only an empty blank.txt file in my generated schemas.
How did you guys solve it? I havent understood where and how you changed code.
@bcipoli:
How do I change back to the old METADATA_DIRECTORY?
Secondly, I just create an empty text file - called emptyhead_blacklist.txt - in the file browser under the folder "METADATA_DIRECTORY"? --> I cannot find such a folder..

@rabsef-bicrym
Copy link
Author

rabsef-bicrym commented Jan 30, 2020

@Georg-coder The way I figured this out was by looking @ what @bcipolli said was the offending commit.

METADATA_DIRECTORY = settings.METADATA_DIRECTORY
#METADATA_DIRECTORY = settings.METADATA_DIRECTORY
METADATA_DIRECTORY = settings.GENERATED_MODELS_DIR

You want to change the third line to the first line - the change was the first line to the third line, so setting it back is "resetting the METADATA_DIRECTORY to the old value"

The directory into which you want to create the empty text file is 990*\irsdb\metadata

Let us know if that helps!

@Georg-coder
Copy link

Georg-coder commented Jan 30, 2020

@rabsef-bicrym Thanks, it worked! Now Im doing:
nohup python manage.py load_filings 2015 &

so it can run without a timeout in the back as described in the read.me.
How do I know when it is done and where do I eventually find/see the downloaded data?
--> I have connected a postgresql database to PyCharm (I called it "django"), which opens in my chrome browser and shows some graphs – but I cannot see anywhere the actual data (screenshot attached).
Screenshot 2020-01-30 at 10 08 59 AM

@rabsef-bicrym
Copy link
Author

@Georg-coder It's going to be hard for me to instruct on that part of things. I personally am pointing a BI tool called Tableau at the database to visualize the data in the ways that I want to. I also ran all of my loads in a screen and not using nohup because I wanted to see if they errored out, which they did on several occasions - sometimes requiring deletion (backing up) of the XML file that loaded and other times just requiring a restart. I think you'll need to think through how you want to use the data and then attempt to connect to it once activity has died down. Additionally you can wait a significant period of time and run the command w/o nohup, again, to see if it is complete or not - the system is smart enough not to try to double load the files.

Best of luck. Perhaps one of the others can provide better instruction.

@Georg-coder
Copy link

Georg-coder commented Jan 31, 2020

@rabsef-bicrym I think your approach with Tableau is pretty nice and I would love to have that in the end as well. Is it working well with Tableau to look at the 990-data?

As you advised, I did load this time in screen and indeed there is a problem that keeps occurring:
...
run_filing 201510199349200201
Traceback (most recent call last):

File "/Users/Georg/PycharmProjects/990-xml-database/venv/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 485, in wrap_socket
cnx.do_handshake()

File "/Users/Georg/PycharmProjects/990-xml-database/venv/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1934, in do_handshake
self._raise_ssl_error(self._ssl, result)

File "/Users/Georg/PycharmProjects/990-xml-database/venv/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1663, in _raise_ssl_error
raise SysCallError(errno, errorcode.get(errno))

OpenSSL.SSL.SysCallError: (54, 'ECONNRESET')
During handling of the above exception, another exception occurred:
.....

--> It might have to do with a few installments I made to resolve problems I had before. I also added to the load_filings.py:
import urllib3.contrib.pyopenssl
urllib3.contrib.pyopenssl.inject_into_urllib3()

When the files before run, does that mean the 990-data (2015) is downloaded to my PC?
I dont remember choosing/creating a location folder for it. I did connect a posgresql database (showing in the right upper corner of the screenshot) but dont really know how that works/what that is. I dont think I actually have server space there, but instead it is just a web-browser interface to do SQL queries...
(https://www.postgresqltutorial.com/load-postgresql-sample-database/)
Screenshot 2020-01-30 at 6 31 44 PM
Screenshot 2020-01-30 at 9 46 50 PM

@bcipolli
Copy link

@Georg-coder Can you take this question to another task, please? Will help others find the question (and answers) more easily.

@Georg-coder
Copy link

Georg-coder commented Feb 3, 2020

@rabsef-bicrym I managed to also view the data now in Tableua over the posgresql server. However, when loading the files, there is still an error terminating the download in between. In case you have stumbled over this or know the solution, I would be very grateful. I am working on this error since a few days: https://stackoverflow.com/questions/60046845/requests-exceptions-connectionerror-connection-aborted-remotedisconnected

@effeappbw
Copy link

Thanks for providing so much valuable insight @bcipolli. This metadata mystery is perplexing. I've been able to run the app, load filings, and even begin querying in sql & r, but I don't think i'm pulling metadata correctly.

I keep getting the following error:

ModuleNotFoundError: No module named 'metadata'

I've retraced every step I can think of:

My question: did your recall this issue during your deployment? If so, I'd appreciate any feedback--if only a general direction to investigate.

My only hunch at this point is that my BASE_DIR is pointing to a venv site packages path which isn't/can't be indexed correctly. (Please forgive me if I bungled protocol/semantics here--this is only my 2nd month programming).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants