-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is this still valid? #24
Comments
There is no official information on which CPUs are affected (or not affected). You can still check the build logs in /mnt/ramdisk/workdir for problems other than a faulty CPU. |
Welp I think this script confirmed it. Another one for RMA. After much toying around (aka wasting valuable time) I tried the suggestion from #23 about disabling OpCode cache. As soon as I did that the kill ryzen script ran for about 20 minutes without crashing - by far the longest it has gone yet. I had to stop the script as I needed to get back on the machine but this proved that my CPU is affected as well. YD1700BBM88AE Very disappointing that AMD still hasn't gotten this under control. Now Newegg is giving me hassle about replacing it too. Ugh! Anyway, thanks for the script and the response to this issue! |
@suaefar I just installed a new replacement I purchased and hit this AGAIN. The new CPU is a week 33: 1733PGS. Testing was the same - a fresh 17.04 flash drive.
Checking build-8 log I see: and loop-12: Do you think I'm really that unlucky to get two post week 25 chips with the bug? I've already replaced the motherboard as well. Thanks! |
You should try to run the memory at stock settings, just to be sure. |
I’ve tried JDEC SPD, XMP, and everything in between with the same results. Also tired bumping DRAM and SOC above the XMP values of 1.35v and 1.1v respectively to no avail. I just tried something new - popped out one stick of ram so I’m down to 8gb. The test ran for 10 minutes before running out of memory. That’s 8 minutes longer than ever before so maybe it’s ram after all or maybe the bug affects the memory controller? Any recommendations on params to run with 8gb? 2 loops and 2 threads? |
I'm in the same boat as you. My first R7 1700 was 1734 and had the bug show up. went through the RMA process with NewEgg and just received... 1734PGS ... same week, showing segfaults so far... lets see if it segfaults at 1.35 volts... |
Then you both probably got a faulty CPUs again. I cannot understand how AMD gets away with this. I am deeply disappointed. With 8Gb RAM better go for 3 loops 5 threads, or 2 loops 8 threads. Good luck! |
Thanks! Yet a third CPU will be here tomorrow. I’m thinking it might be time to look at getting a class action suit together to get them talk. I used to love AMD but this is beyond ridiculous! |
where do you live that NewEgg RMAs so fast? or did you buy from another source? what mobo and ram do you have? I'm even trying older BIOS versions to see if that might help.. hasn't so far. I use this PC for work and have lost hours to this, i need a place that will do an advanced replacement... or suck it up and get a zen+ processor when they come out. hopefully its not in that micro architecture as well... |
Newegg refused to exchange it because it’s past 30 days. I was fighting some other issues and decided to RMA the motherboard first. By time I switched it out the motherboard Newegg closed my RMA for the CPU. They also pissed me off because they wouldn’t take the motherboard back because I had sent the UPC in for the rebate. This time I ordered from Amazon. I also use my PC for work as I work from home and couldn’t afford downtime. Newegg won’t do advanced replacement or returns on CPUs btw. I ended up filing a claim with my credit card for the return protection because of this mess. I finally got back a response from AMD days later and they approved an RMA no questions asked and gave me a 2 day label. Amazon was a once click exchange and they do advanced replacement so hopefully the one coming tomorrow is not bugged. If it is, I’ll go through the AMD RMA. One way or another I’m getting to the bottom of this. I compile code on Linux for work and need this stable! I’ll send that back an |
Motherboard is an ASRock X370 Taichi. I’ve tried different bios versions without luck. For memory I’ve got 2x8gb GSkill FlareX. |
I have the same ram as you but have the aorus gaming k7. I think I'm going to wait until zen+ comes out to rma it, then sell it because i seem to have bad luck. At least this one is 100% stable with my RAMs xmp profile according to stressapptest and aida64 |
Okay so I received my third CPU which is a 1744SUS this time and this script failed in about two minutes again with the segfault error. Given that this is the third post-week 25 chip I've had fail I'm pretty convinced at this point that something else is going on. Either something with the way this script runs on my machine (some weird thing when using zram?), memory settings, etc. I did experience some random lockups in both Windows and Linux without any MCE or BSODs with my first chip so I definitely thing that one had something wrong. At this point I'm going to try running some real-world workloads and see if I can reproduce it. If so I'll dig deeper into the motherboard & ram. One other interesting note is that AMD told me "please update your motherboard BIOS to the latest version with AGESA 1.0.0.6b after installing the CPU" which I have but it does tell me something about the BIOS could be impacting this. I've been on 3.30 but I'll try a few other versions. |
A short update on the segfault saga: I've determined that disabling ASLR does indeed workaround the segfault issue or at least make it so I can't reproduce it with this script. I'm not sure I want to leave it disabled though as it is a small security risk running without it. I also received my RMA replacement from AMD today and it's a 1733SUS. Funny thing is it seems to be very common to get this batch number when you RMA it for this issue so perhaps it's a "known good" batch. I'll get it installed this week and run some tests. |
Finally and end to all this. I was able to complete 12 hours of the ryzen test without any issues using the 1733SUS that AMD send as a replacement. It's very peculiar that 3/3 of the retail purchases were bugged but AMD sent me a non-bugged item. I also notice MANY people are getting the 1733SUS back as a replacement. It makes me wonder if this is some sort of golden batch that is know working and AMD kept them to use for replacements. Meanwhile the other CPUs on the shelf are most likely bugged regardless of the week number, at least is my painful experiences. So the answer is yes, this test is still valid. Thanks for putting this together and shame on AMD for selling known bugged CPUs! |
Thank you for sharing your story. |
I will re-open this issue until it finally is no issue anymore... |
I'm witnessing something interesting. When I run kill-ryzen.sh with no parameters it runs a lot longer before failing compared to kill-ryzen.sh 4 4 and it failing under two minutes is it just how the processor is being stressed that causes the difference in rate if failure? I'm waiting until the 2700x has been out for a while before buying that as my replacement, and to make sure others test it to make sure the segfault bug doesn't exist with the refrrsh |
@disturbednny : what is the exact error you are hitting? Running 4 X 4 means you have for loops with 4 threads; without any parameter, you are running as many loops as there are threads on your CPU. Each loop will take longer to compile GCC. However, the stress will be similar or a bit higher with the latter. If it takes more time to fail with no parameters, that could indicate a problem with the compilation itself, not with the CPU. |
I'll have to run them again to get the segfault errors, but they are kernel segfault checks that show up when I type dmesg, and follow the error format in the log entries jstarcher posted with the line starting with make[5] |
Heres the dmesg output: loop-2 log loop-0 log |
This looks like you got a faulty Ryzen :( |
Probably many. |
Some Good news, I received my R7 2700X this past saturday, and successfully ran the kill-ryzen script for 8 hours straight with no segfault. So it looks like it is not present in the R7 and R5 2000 series. RMA'd my 1700 after installing the 2700X so we'll see what they give me. |
That's good news! I was really hoping that they would get it under control eventually. |
Thank you very much for providing this test! |
Hi. So to be clear is disabling ASLR the answer to some of these issues? I've had a Ryzen 1700 1733PGS since Oct 2017 and the thing has been nothing but trouble. Dealing with this https://bugzilla.kernel.org/show_bug.cgi?id=196683 in additional to the general protection faults. I am running latest AGESA. My memory has been tested ok. Basically I can reproduce a general protection fault very easily by just running something that uses several threads. For instance using Saltstack config management commands I could repo a fault just about every time I ran a somewhat intensive job with ASLR on. With ASLR off I get no protection faults. Thanks! |
@infoveinx : short answer is we don't know. As long as AMD won't recognize and disclose the problem, we can't tell for sure. |
I've been building computers since the 1990s, and many of those have been specific for running Linux as either a deskstop or server for personal use. In all of those years I've never encountered the kind of problems I've seen with this CPU. I've used both AMD and Intel too. I feel like 2017/2018 have been the worst given these issues, and not to mention things like spectre/meltdown muddying the waters even more so. AMD needs to open up about this issue, because it's quite obvious that there are real problems with this generation of processors. I know I'm preaching to the choir here. If you look at the link I posted above way down you'll see where folks have gone to several of the top techie news sites and reported some of the issues with these processors. They either get no response, or a response that states they aren't having the issues reported. I can't imagine this is the case with so many people reporting the same problems. It feels like one giant coverup if you ask me that even involves news and tech sites that test and write reviews on hardware. |
Basically AMD needs to release more info, because I don't think we fully understand this issue, hopefully 2700s are fixed as you say. My UA1733PGS has been running with no issues since my other thread and upped voltages. |
Increased voltages mean increased power consumption, more heat, higher temperatures and possibly lower performance... it is a workaround but no fix. |
Yes there is obviously an inherent issue, wouldn't be surprised if it's a design flaw in the end. |
suaefar, thanks for the effort from your side to help isolate and reproduce the problem. I've got a 1700 and three 1800X CPUs. I bought my 1700 in March 2017, I expected problems with new micro architecture as we have seen in the past. I was surprised when I found my 1800X CPUs are from the first week of production even though I bought them individually in December 2017 to January 2018. I tried to RMA one (1707) in April 2018. I can't afford to stop using all of them and I wasn't sure if replacements would work. AMD accepted my request (thanks to your "ryzen-test"), however when it came to shipping address I found out that AMD does not support my country. It looks like I am in it for the long run. Until a month ago I did not have so many problems, but since I starting using newer kernels >= 4.15 I've seen multiple crashes per day, resulting in filesystem corruption beyond the point that fsck will repair. I've tried various distributions/kernels, CPU pinning, hugepages to try and isolate workloads in virtual machines. Thus far my best was 188 days uptime on my 1700 using pve-manager/5.0-23/af4267bf (running kernel: 4.10.15-1-pve) it's based on debian 9.4. Strange observation that my machine with the most memory did the best, it has 4x16GB RAM. I know this project aims to determine if your hardware is faulty or not. I'm begging for any advice on a workaround that would make my system(s) stable without spending a huge sum of money i.e. buying new CPUs or Windows license for each machine. Can we pressure AMD to assist kernel devs or provide more information on what they changed between chip revisions? |
Hi skarr, disabling ASLR, µOP-caching and SMT was on some occasions reported to increase stability. Unfortunately, we know close to nothing because AMD never shared any bit of information on this issue with us. |
Thanks for the advice, I really appreciate it. I found a product errata document by AMD in this post. This one stood out to me: 1109 MWAIT Instruction May Hang a Thread
I also found new responses in Kernel.org Bugzilla stating I will create a new issue/update this one if I find anything useful. |
@skarr have a look at this project: https://github.com/qrwteyrutiyoup/ryzen-stabilizator Disabling C6, ALSR, and enabling the power supply idle workaround helped me. Without these even my replacement “not bugged” CPU had random reboots and soft pickups on Ubuntu. I created a systemd startup unit to make these changes automatically. TL;DR there’s other problems with first gen Ryzen in Linux outside of this compilation bug :( |
Finally did the RMA. They sent me a UA1733SUS, the one I sent back was a UA1733PGS. The new one is also not behaving. I reset bios and I'm also running latest bios. The only change I made under advanced CPU section was to enable the typical current idle, and SMV. I run about 5 qemu-kvm VMs. Basically the ones that actually do stuff are crashing with kernel panics randomly (just like before). If you let one sit long enough in panic state without killing it the host system will eventually have some kind of kernel issue and lock up. Generally the system is close to idle though as the VMs don't have much activity. To be clear I'm now on a 4.17 kernel from Debian Stretch backports. I've used 4.12, 4.13, 4.14, 4.15, and 4.16 prior. All of them unstable (although once upon a time 4.13 had a long uptime) but that was after disabling C-states both in bios and in software, and disabling ASLR. Subsequent kernel versions didn't seem to make a difference even with all of that disabled. Even with 4.13 every so often I would see a VM go to 100% and have to be restarted, though much less frequent. I also had a much older BIOS at that time. During the RMA process I moved all of the VM images back to an old 2012 Intel i3 that I had used prior. Not a single problem from that system in the week that I ran it and at times under heavy load. I was going to try a bunch of stuff like, CPU pinning with VMs etc, but I've read other folks tried that and it still crashed. I'm not going to continue trying to make this work. I'm just not going to buy AMD ever again. In fact if I must I will purchase older gen processors after doing research to ensure they can handle running VMs under Linux. This has been a fight since November 2017 and I've lost countless hours. If anyone has suggestions I'm open to them, but at this point it seems like a lost fight and time to move on. |
Update. The initial crashes that I encountered with the replacement CPU were still under 4.16 kernel. The only incident I had with 4.17 kernel was starting the 5 linux VMs simultaneously and then one of them crashed not long after startup. Something else I did was disable IOMMU in bios almost two days ago and I can't quite recall if I did this prior to the single 4.17 kernel incident. I let it sit mostly idle for a little over one day and didn't experience a crash or idle lockup. Today I tried to use every method prior to crash it and it never experienced a single hiccup. I'm not sure what to make of it yet so going to let it go longer and see what happens. Unfortunately in the past I've seen it crash anywhere from within minutes, to multiple days. |
Latest update. System went 7 days without issue but is now back to being completely unstable. During those 7 days I ran it through a gambit of things from normal tasks, to many simultaneous things involving network I/O, disk both SATA and USB 3.0 transfers along with some stress-ng runs. I had 6 Linux vms running on it and it never hinted a single issue. I find it on 7th day locked up with a kernel panic. Since then it is completely unstable, with or without vms running. It's very hard for me to understand how it can run so perfect and the suddenly become so unstable with no changes. Some thoughts on this. I would assume if the mobo were bad that the behavior would have shown well before 7 days. The same for PSU. I did run RAM through an 8 hour memtest some time back and saw no problems. The only conclusion I can come to here is that the Linux kernel itself is just not working well with this CPU for whatever reason. The other thing I wondered is if perhaps something about the mobo is providing incorrect voltages and over time is degrading the CPU in someway. I can't really deal with this any longer so I think for now the system will just get shelved and replaced with a previous gen Intel. I've been reading around again and I see several folks who are essentially dealing with the same kind of conditions. Ie, stable then completely unstable with general protection faults/segfaults and basic system lockups etc. Yes it does seem like many people who RMA are getting 2017 Week 33 replacements. |
Have you tried everything outlined here: http://blog.programster.org/stabilizing-ubuntu-16-04-on-ryzen These changes seemed to really help me. At this point the only thing I get is a soft lockup on occasion which I’m 99% sure it’s an Nvidia driver issue. I can ssh to the machine and Xorg is locked up and I have errors in the Xorg log but I haven’t seen evidence of CPU instability. Also try with completely stock ram settings, not XMP profile. XMP isn’t guaranteed to be stable. Worth upping soc voltage to 1.1v if you haven’t yet as well. |
Yeah, thanks. I've tried all of those things and then some. The longest run I had was on kernel 4.13 (I'm running Debian 9.x so I'm using backports to get newer kernels). Basically it came down to disable C-states in bios, disable the remaining states via the Zenstates script, disable ASLR, and finally blacklisting nouveau driver. I went over 100 days uptime with that, however I still had the occasional VM lockup when I would do a heavy file transfer over network on the host system itself (not a VM on said system). Since then I've updated to latest bios and iterated over kernels 4.14, 4.15, 4.16, and now on 4.17. I have also tried not doing XMP with no change in results. Running XMP is showing the RAM rated timings in bios fwiw. That 100+ days of uptime was also on the bios that came with mobo which was quite old and prior to the addition of the power supply configurable idle states that AMD added. I haven't had a soft lockup in a very long time, my issues all appear to be related to memory now. I also tried to disable SMT, Opcache, etc. In the end the only way to keep it working was to pass maxcpus=1 to kernel so that only a single CPU was used. In that case I was able to copy files across network (Samb) and back and forth to a USB 3.0 drive (as backup) with no crashes. I'm not running Xorg or any GUI on this system, but I do have a Geforce 210 as the video card. I've yet to see kernel errors related to video. I tried adding voltage slowly and that seemed to increase problems (on the returned CPU), but I am willing to try it again with new CPU. For clarity here are details of my system. Gigabyte AB350-Gaming 3 Bios F23d |
Also noticing these in boot log. Don't recall seeing them prior. I know there has been talk of fixes related to this in newer kernels. No idea how it relates but as stated prior I'm already on 4.17 kernel. Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) |
Did the following.
System is still randomly unstable. Had a hard locked CPU related to KVM, random segfaults may or may not happen after each reboot while trying to run a command that I know generates them sometimes. Still boggles my mind that it went 7 days with no issue running all kinds of tasks. No vms running, got these when I went to copy the qcow2 images to the Raid 1 to start backup and rebuild process with another system. page:fffff89a1dd21880 count:0 mapcount:0 mapping:0000000000000f00 index:0x1 I rebooted and set SOC back to Auto and then copied the 67G worth of qcow2 to the Raid1 no problem. Maybe there is just some kind of voltage regulator problem here I'm not sure. I'll mess with it on the side while I have the hopefully stable replacement up. I put a load of 21 on it last night via converting some h.265 to h.264 video with ffmpeg, all while running other things in a loop to try to break it, and of course it had zero problems. Running VMs though is a matter of time (and much shorter time lately). |
This sounds familiar... |
Unfortunately this is with the RMA CPU. I sent in a 1733PGS and got a 1733SUS back. |
I agree, might be time to RMA the motherboard and/or RAM. Maybe try one stick of ram at a time to try to isolate if you have a bad stick. |
Helo, just a quick question. |
Theoretically, yes. |
Same experience here. Memtest ran overnight without finding any errrors. This bug requires heavy CPU usages across all threads to trigger which memtest doesn’t do. That isn’t to say it is impossible for it to cause it to fail though. |
Well, I am sending my 1700x for rma. Meanwhile I borrowed a 2600 and will check if the problem persist. Also, what is the expected final step of this script? |
If you have this problem today, you can disable the uop cache in AMD CBS settings in your BIOS (UEFI settings). uop as in micro-op, as in mu-op (μop). It hardly reduces performance if you disable it, but it completely fixes this issue. The code has to be tons of huge instructions to even be able to measure a difference in performance. The instruction decoder is so good, you hardly even need the uop cache. I didn't see "uop" mentioned in this thread, hopefully this isn't redundant. |
I'm wondering if any can confirm if this is still a valid test? I downloaded 17.04 and ran the tests as described and can't make it more than 200 seconds running all stock settings. I'm on a week 43 Ryzen 1700 and I can't seem to make anything else fail. 8hrs of Prime95, 8hrs of Memtest86, etc.
I've played with the DRAM voltage as well as SoC voltage and it didn't have any affect. One thing I noticed was that my integrated wifi adapter would throw a message in syslog and as soon as that happened this kill ryzen script would fail. I disabled the wifi adapter in bios and that allowed it to run longer which makes me wonder if this script fails on false positives?
The text was updated successfully, but these errors were encountered: