-
-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linux-aarch64 support? #136
Comments
There are issues with running azure CIs with a package this size! We have some machines that can be used to manually trigger builds. Many of use are using our personal x86 64 bit machines and OSX machines to do this package anyway. I think you should add it to the migrator to ensure that all the dependencies are migrated. I suspect you will need to go through first. |
fyi: one could emulate aarch64 on M1 Macs relatively smoothly, e.g. using multipass.run (I believe I built tf2.7 once for aarch64 back in November) |
I have colleagues with the Apple M1 chipset, and we do lots of Dockerized development. Aarch64 is the relevant platform for Docker on M1. (In contrast, emulation of linux-64 runs roughly 6× more slowly.) I realize that there are lots of competing priorities, but it would be great to support this eventually. |
PR s always welcome. |
@hmaarrfk so it's not possible to do the build on azure? What would be roughly needed for the PR? I can take a stab at it. We plan on using aarch64 (AWS gravitron) heavily so TF builds for aarch64 would be super useful. |
Not Mark, but my guess is one would want to do something like these 2 PRs (this could be one PR): |
We have been building TF locally. So ultimately, we don't need it to pass on azure, just to get "far enough" to convince us the recipe is working (very loose requirement), then typically somebody follows: |
May I ask whether the core members have machines for aarch64 and ppc64le, or have to build them on linux-64? |
I have been building qt main using emulation. It takes time, but maybe we can restrict the number of builds for aarch. |
FWIW, I believe the Jax developers implemented a few recent changes in upstream tensorflow that should make this (aarch64, strictly speaking) relatively straightforward/doable, though I personally didn't check and I likely won't be able to help much to initiate an effort. As I said above, if you have an M1/M2 machine, you could also reasonably test and build this there (in Docker, Multipass, etc.) --- if you start and get stuck, please tag me and I will try to help As a rough guess, I think simply following what the bot would do (e.g. look at another PR and mimic it) should likely go a long way |
Ok-- I'm going to take a stab at this when I have a moment (might be 2-3 weeks). Looks like OSX ARM64 is already covered so it's mainly linux aarch64 that we need. The actual compilation/usage of tensorflow on aarch64 works fine (we've already tried it). Main thing is ensuring the build uses similar build settings to what the other architectures use. I have access to actual aarch64 linux instances which should be better for initial testing since the TF build is hefty. On my M1 mac it's not all that fast. |
Current blocker if someone wants to help: conda-forge/tensorboard-data-server-feedstock#14 |
Hi @iamthebot, are you happy to give this a go? We now have tensorboard for linux-aarch64: conda-forge/tensorboard-data-server-feedstock#18 As you said, I think mainly what it takes is a powerful aarch64 machine. Unfortunately all I got is a raspberry pi 4. I think the conda-forge aarch instance still does not work (@hmaarrfk @isuruf)? |
Another common example is an Apple M1 or M2 running Docker. (I don't have one myself.) |
I'm not sure, but I'm guessing you could just run the Linux script (of course replacing Regarding having built only 2.12, if it's not too much work to rebuild, you should be able to request that a |
According to the release notes, tensorboard 2.12 is the first one to support |
Dumb question-- do we really need to rebuild from source or can we just leverage the compiled libs from the officially released libs? Per @Tobias-Fischer that's the implication right? While I can certainly do a few one-off aarch64 builds not sure I can commit to a dedicated box for this / the overhead of running these builds on a cadence. |
Unfortunately, in conda-forge we cannot leverage pre-compiled libs. I found that using an M1-powered MacBook works well for linux-aarch64 builds using the build-locally.py which uses Docker. |
If that's the case happy to help with some builds (have access to an M1 mac as well) assuming I'm not a single point of failure. Like I said, I can also easily run builds on an aarch64 linux instance. What would be needed to get started? |
did emulation really not work? typically, I can spare my workstation for 1 day. 6 hours = 30 hours of emulation. So if you have a few more than 2 cores, you can get that down to 5 hours. I guess tensorflow is already 3 hours natively, so maybe 15 hours for aarch with emulation? |
Our progress is here: #301 Basically, we are stuck with some build system changes that seem to have occurred .. I might give it a go once they release another RC. |
That was my plan too. Btw, my general advice is: Wait until we clear the six hours on the public CI first before doing much locally unless you know precisely what you're doing. Currently, I really have no idea what's going on with the error we hit. One time in the past, I simply identified the commit that seemed to break things and kept nudging the person who committed it upstream (for Tensorflow, many important commits "just happen" directly on master/main without PRs because they're often pushed by internal engineers). After a week or so of tagging that person, they added a commit that essentially fixed the issue without notification or update (I noticed the commit because it tagged the comments in the issue). Once there's a working version with high confidence, we rarely have problem finding volunteers to build locally. I really wouldn't worry about that. However, we cannot ask volunteers to help if we aren't relatively confident we have something working. For now, we simply don't have a passing build and it's erroring relatively early in the build process. That's the focus. When adding a new arch, like aarch64, I think we will wait to assess that once we have the other builds working (especially linux-64).
I think the aarch64 build on M1 Macs should be much faster btw, perhaps close to the regular osx build |
Issue:
Tensorflow builds are not currently run for aarch64. Raising this issue here before just opening a PR to add this to the migrator since I suspect there may be issues with running linux-aarch64 CI with a package of this size. Note that GPU support is not necessary for my use case, I'd just like to get to the point on this architecture where enough is in place to open and work with pretrained model files.
The text was updated successfully, but these errors were encountered: