Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this still valid? #24

Open
jstarcher opened this issue Feb 21, 2018 · 51 comments
Open

Is this still valid? #24

jstarcher opened this issue Feb 21, 2018 · 51 comments

Comments

@jstarcher
Copy link

I'm wondering if any can confirm if this is still a valid test? I downloaded 17.04 and ran the tests as described and can't make it more than 200 seconds running all stock settings. I'm on a week 43 Ryzen 1700 and I can't seem to make anything else fail. 8hrs of Prime95, 8hrs of Memtest86, etc.

I've played with the DRAM voltage as well as SoC voltage and it didn't have any affect. One thing I noticed was that my integrated wifi adapter would throw a message in syslog and as soon as that happened this kill ryzen script would fail. I disabled the wifi adapter in bios and that allowed it to run longer which makes me wonder if this script fails on false positives?

@suaefar
Copy link
Owner

suaefar commented Feb 21, 2018

There is no official information on which CPUs are affected (or not affected).
Your description here does fit the ryzen segfault bug.
Prime95 and Memtest86 are not (as) sensitive to the bug as this workload.
If you hit a segfault that rapidly (less than 5 minutes) and only one or a few processes fail (and some continue running), then your CPU is probably affected.
If all processes fail within a short period it may be due to another problem.

You can still check the build logs in /mnt/ramdisk/workdir for problems other than a faulty CPU.

@jstarcher
Copy link
Author

Welp I think this script confirmed it. Another one for RMA. After much toying around (aka wasting valuable time) I tried the suggestion from #23 about disabling OpCode cache. As soon as I did that the kill ryzen script ran for about 20 minutes without crashing - by far the longest it has gone yet. I had to stop the script as I needed to get back on the machine but this proved that my CPU is affected as well.

YD1700BBM88AE
UA 1743SUT

Very disappointing that AMD still hasn't gotten this under control. Now Newegg is giving me hassle about replacing it too. Ugh!

Anyway, thanks for the script and the response to this issue!

@jstarcher
Copy link
Author

@suaefar I just installed a new replacement I purchased and hit this AGAIN. The new CPU is a week 33: 1733PGS. Testing was the same - a fresh 17.04 flash drive.

sudo dmidecode -t memory | grep -i -E "(rank|speed|part)" | grep -v -i unknown Speed: 3200 MHz Part Number: F4-3200C14-8GFX Rank: 1 Configured Clock Speed: 1600 MHz Speed: 3200 MHz Part Number: F4-3200C14-8GFX Rank: 1 Configured Clock Speed: 1600 MHz uname -a Linux ubuntu 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux cat /proc/sys/kernel/randomize_va_space 2 / /mnt/ramdisk/workdir /mnt/ramdisk/workdir Using 16 parallel processes [KERN] -- Logs begin at Fri 2018-02-23 15:36:39 EST. -- [KERN] Feb 23 15:36:55 ubuntu systemd[1]: snapd.refresh.timer: Adding 3h 21min 14.449848s random time. [KERN] Feb 23 15:36:55 ubuntu systemd[1]: apt-daily.timer: Adding 2h 14min 40.656684s random time. [KERN] Feb 23 15:36:55 ubuntu systemd[1]: motd-news.timer: Adding 51min 54.931579s random time. [KERN] Feb 23 15:37:05 ubuntu systemd[1]: snapd.refresh.timer: Adding 1h 54min 27.129499s random time. [KERN] Feb 23 15:37:05 ubuntu systemd[1]: snapd.refresh.timer: Adding 44min 49.245281s random time. [KERN] Feb 23 15:37:05 ubuntu systemd[1]: apt-daily.timer: Adding 4h 55min 48.521122s random time. [KERN] Feb 23 15:37:05 ubuntu systemd[1]: motd-news.timer: Adding 18min 53.032133s random time. [KERN] Feb 23 15:39:10 ubuntu kernel: zram: Added device: zram0 [KERN] Feb 23 15:39:10 ubuntu kernel: zram0: detected capacity change from 0 to 68719476736 [KERN] Feb 23 15:39:10 ubuntu kernel: EXT4-fs (zram0): mounted filesystem with ordered data mode. Opts: discard [loop-0] Fri Feb 23 15:39:46 EST 2018 start 0 [loop-1] Fri Feb 23 15:39:47 EST 2018 start 0 [loop-2] Fri Feb 23 15:39:48 EST 2018 start 0 [loop-3] Fri Feb 23 15:39:49 EST 2018 start 0 [loop-4] Fri Feb 23 15:39:50 EST 2018 start 0 [loop-5] Fri Feb 23 15:39:51 EST 2018 start 0 [loop-6] Fri Feb 23 15:39:52 EST 2018 start 0 [loop-7] Fri Feb 23 15:39:53 EST 2018 start 0 [loop-8] Fri Feb 23 15:39:54 EST 2018 start 0 [loop-9] Fri Feb 23 15:39:55 EST 2018 start 0 [loop-10] Fri Feb 23 15:39:56 EST 2018 start 0 [loop-11] Fri Feb 23 15:39:57 EST 2018 start 0 [loop-12] Fri Feb 23 15:39:58 EST 2018 start 0 [loop-13] Fri Feb 23 15:39:59 EST 2018 start 0 [loop-14] Fri Feb 23 15:40:00 EST 2018 start 0 [loop-15] Fri Feb 23 15:40:01 EST 2018 start 0 [loop-12] Fri Feb 23 15:42:13 EST 2018 build failed [loop-12] TIME TO FAIL: 147 s [KERN] Feb 23 15:42:13 ubuntu kernel: traps: bash[32728] general protection ip:445b20 sp:7fff1ce38448 error:0 [KERN] Feb 23 15:42:13 ubuntu kernel: in bash[400000+100000] [KERN] Feb 23 15:42:26 ubuntu kernel: IPv6: ADDRCONF(NETDEV_UP): wlp9s0: link is not ready [loop-8] Fri Feb 23 15:43:32 EST 2018 build failed [loop-8] TIME TO FAIL: 226 s [KERN] Feb 23 15:43:32 ubuntu kernel: bash[21958]: segfault at d ip 0000000000431f2e sp 00007ffc28f648c0 error 4 in bash[400000+100000] [KERN] Feb 23 15:47:41 ubuntu kernel: IPv6: ADDRCONF(NETDEV_UP): wlp9s0: link is not ready [KERN] Feb 23 15:52:57 ubuntu kernel: IPv6: ADDRCONF(NETDEV_UP): wlp9s0: link is not ready [KERN] Feb 23 15:58:12 ubuntu kernel: IPv6: ADDRCONF(NETDEV_UP): wlp9s0: link is not ready

Checking build-8 log I see:
/bin/bash ../libtool --tag=CC --mode=link gcc -DNO_ASM -g -version-info 5:4:1 -static-libstdc++ -static-libgcc -o libmpfr.la -rpath /usr/local/lib exceptions.lo extract.lo uceil_exp2.lo uceil_log2.lo ufloor_log2.lo add.lo add1.lo add_ui.lo agm.lo clear.lo cmp.lo cmp_abs.lo cmp_si.lo cmp_ui.lo comparisons.lo div_2exp.lo div_2si.lo div_2ui.lo div.lo div_ui.lo dump.lo eq.lo exp10.lo exp2.lo exp3.lo exp.lo frac.lo frexp.lo get_d.lo get_exp.lo get_str.lo init.lo inp_str.lo isinteger.lo isinf.lo isnan.lo isnum.lo const_log2.lo log.lo modf.lo mul_2exp.lo mul_2si.lo mul_2ui.lo mul.lo mul_ui.lo neg.lo next.lo out_str.lo printf.lo vasprintf.lo const_pi.lo pow.lo pow_si.lo pow_ui.lo print_raw.lo print_rnd_mode.lo reldiff.lo round_prec.lo set.lo setmax.lo setmin.lo set_d.lo set_dfl_prec.lo set_exp.lo set_rnd.lo set_f.lo set_prc_raw.lo set_prec.lo set_q.lo set_si.lo set_str.lo set_str_raw.lo set_ui.lo set_z.lo sqrt.lo sqrt_ui.lo sub.lo sub1.lo sub_ui.lo rint.lo ui_div.lo ui_sub.lo urandom.lo urandomb.lo get_z_exp.lo swap.lo factorial.lo cosh.lo sinh.lo tanh.lo sinh_cosh.lo acosh.lo asinh.lo atanh.lo atan.lo cmp2.lo exp_2.lo asin.lo const_euler.lo cos.lo sin.lo tan.lo fma.lo fms.lo hypot.lo log1p.lo expm1.lo log2.lo log10.lo ui_pow.lo ui_pow_ui.lo minmax.lo dim.lo signbit.lo copysign.lo setsign.lo gmp_op.lo init2.lo acos.lo sin_cos.lo set_nan.lo set_inf.lo set_zero.lo powerof2.lo gamma.lo set_ld.lo get_ld.lo cbrt.lo volatile.lo fits_sshort.lo fits_sint.lo fits_slong.lo fits_ushort.lo fits_uint.lo fits_ulong.lo fits_uintmax.lo fits_intmax.lo get_si.lo get_ui.lo zeta.lo cmp_d.lo erf.lo inits.lo inits2.lo clears.lo sgn.lo check.lo sub1sp.lo version.lo mpn_exp.lo mpfr-gmp.lo mp_clz_tab.lo sum.lo add1sp.lo free_cache.lo si_op.lo cmp_ld.lo set_ui_2exp.lo set_si_2exp.lo set_uj.lo set_sj.lo get_sj.lo get_uj.lo get_z.lo iszero.lo cache.lo sqr.lo int_ceil_log2.lo isqrt.lo strtofr.lo pow_z.lo logging.lo mulders.lo get_f.lo round_p.lo erfc.lo atan2.lo subnormal.lo const_catalan.lo root.lo sec.lo csc.lo cot.lo eint.lo sech.lo csch.lo coth.lo round_near_x.lo constant.lo abort_prec_max.lo stack_interface.lo lngamma.lo zeta_ui.lo set_d64.lo get_d64.lo jn.lo yn.lo rem1.lo get_patches.lo add_d.lo sub_d.lo d_sub.lo mul_d.lo div_d.lo d_div.lo li2.lo rec_sqrt.lo min_prec.lo buildopt.lo digamma.lo bernoulli.lo isregular.lo set_flt.lo get_flt.lo scale2.lo set_z_exp.lo ai.lo gammaonethird.lo grandom.lo -lgmp ßßßßßßßßßßß^K Makefile:518: recipe for target 'libmpfr.la' failed make[5]: *** [libmpfr.la] Segmentation fault (core dumped) make[5]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr/src' Makefile:446: recipe for target 'all' failed make[4]: *** [all] Error 2 make[4]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr/src' Makefile:468: recipe for target 'all-recursive' failed make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr' Makefile:6475: recipe for target 'all-stage1-mpfr' failed make[2]: *** [all-stage1-mpfr] Error 2 make[2]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8' Makefile:27079: recipe for target 'stage1-bubble' failed make[1]: *** [stage1-bubble] Error 2 make[1]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8' Makefile:941: recipe for target 'all' failed make: *** [all] Error 2 /bin/bash ../libtool --tag=CC --mode=link gcc -DNO_ASM -g -version-info 5:4:1 -static-libstdc++ -static-libgcc -o libmpfr.la -rpath /usr/local/lib exceptions.lo extract.lo uceil_exp2.lo uceil_log2.lo ufloor_log2.lo add.lo add1.lo add_ui.lo agm.lo clear.lo cmp.lo cmp_abs.lo cmp_si.lo cmp_ui.lo comparisons.lo div_2exp.lo div_2si.lo div_2ui.lo div.lo div_ui.lo dump.lo eq.lo exp10.lo exp2.lo exp3.lo exp.lo frac.lo frexp.lo get_d.lo get_exp.lo get_str.lo init.lo inp_str.lo isinteger.lo isinf.lo isnan.lo isnum.lo const_log2.lo log.lo modf.lo mul_2exp.lo mul_2si.lo mul_2ui.lo mul.lo mul_ui.lo neg.lo next.lo out_str.lo printf.lo vasprintf.lo const_pi.lo pow.lo pow_si.lo pow_ui.lo print_raw.lo print_rnd_mode.lo reldiff.lo round_prec.lo set.lo setmax.lo setmin.lo set_d.lo set_dfl_prec.lo set_exp.lo set_rnd.lo set_f.lo set_prc_raw.lo set_prec.lo set_q.lo set_si.lo set_str.lo set_str_raw.lo set_ui.lo set_z.lo sqrt.lo sqrt_ui.lo sub.lo sub1.lo sub_ui.lo rint.lo ui_div.lo ui_sub.lo urandom.lo urandomb.lo get_z_exp.lo swap.lo factorial.lo cosh.lo sinh.lo tanh.lo sinh_cosh.lo acosh.lo asinh.lo atanh.lo atan.lo cmp2.lo exp_2.lo asin.lo const_euler.lo cos.lo sin.lo tan.lo fma.lo fms.lo hypot.lo log1p.lo expm1.lo log2.lo log10.lo ui_pow.lo ui_pow_ui.lo minmax.lo dim.lo signbit.lo copysign.lo setsign.lo gmp_op.lo init2.lo acos.lo sin_cos.lo set_nan.lo set_inf.lo set_zero.lo powerof2.lo gamma.lo set_ld.lo get_ld.lo cbrt.lo volatile.lo fits_sshort.lo fits_sint.lo fits_slong.lo fits_ushort.lo fits_uint.lo fits_ulong.lo fits_uintmax.lo fits_intmax.lo get_si.lo get_ui.lo zeta.lo cmp_d.lo erf.lo inits.lo inits2.lo clears.lo sgn.lo check.lo sub1sp.lo version.lo mpn_exp.lo mpfr-gmp.lo mp_clz_tab.lo sum.lo add1sp.lo free_cache.lo si_op.lo cmp_ld.lo set_ui_2exp.lo set_si_2exp.lo set_uj.lo set_sj.lo get_sj.lo get_uj.lo get_z.lo iszero.lo cache.lo sqr.lo int_ceil_log2.lo isqrt.lo strtofr.lo pow_z.lo logging.lo mulders.lo get_f.lo round_p.lo erfc.lo atan2.lo subnormal.lo const_catalan.lo root.lo sec.lo csc.lo cot.lo eint.lo sech.lo csch.lo coth.lo round_near_x.lo constant.lo abort_prec_max.lo stack_interface.lo lngamma.lo zeta_ui.lo set_d64.lo get_d64.lo jn.lo yn.lo rem1.lo get_patches.lo add_d.lo sub_d.lo d_sub.lo mul_d.lo div_d.lo d_div.lo li2.lo rec_sqrt.lo min_prec.lo buildopt.lo digamma.lo bernoulli.lo isregular.lo set_flt.lo get_flt.lo scale2.lo set_z_exp.lo ai.lo gammaonethird.lo grandom.lo -lgmp ßßßßßßßßßßß^K Makefile:518: recipe for target 'libmpfr.la' failed make[5]: *** [libmpfr.la] Segmentation fault (core dumped) make[5]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr/src' Makefile:446: recipe for target 'all' failed make[4]: *** [all] Error 2 make[4]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr/src' Makefile:468: recipe for target 'all-recursive' failed make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr' Makefile:6475: recipe for target 'all-stage1-mpfr' failed make[2]: *** [all-stage1-mpfr] Error 2 make[2]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8' Makefile:27079: recipe for target 'stage1-bubble' failed make[1]: *** [stage1-bubble] Error 2 make[1]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8' Makefile:941: recipe for target 'all' failed make: *** [all] Error 2

and loop-12:
Makefile:864: recipe for target 'libgmp.la' failed make[5]: *** [libgmp.la] Segmentation fault (core dumped) make[5]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12/gmp' Makefile:954: recipe for target 'all-recursive' failed make[4]: *** [all-recursive] Error 1 make[4]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12/gmp' Makefile:773: recipe for target 'all' failed make[3]: *** [all] Error 2 make[3]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12/gmp' Makefile:5521: recipe for target 'all-stage1-gmp' failed make[2]: *** [all-stage1-gmp] Error 2 make[2]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12' Makefile:27079: recipe for target 'stage1-bubble' failed make[1]: *** [stage1-bubble] Error 2 make[1]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12' Makefile:941: recipe for target 'all' failed make: *** [all] Error 2

Do you think I'm really that unlucky to get two post week 25 chips with the bug? I've already replaced the motherboard as well.

Thanks!

@jstarcher jstarcher reopened this Feb 23, 2018
@m-r-s
Copy link

m-r-s commented Feb 23, 2018

You should try to run the memory at stock settings, just to be sure.
Unstable memory also can result in segfaults.

@jstarcher
Copy link
Author

I’ve tried JDEC SPD, XMP, and everything in between with the same results. Also tired bumping DRAM and SOC above the XMP values of 1.35v and 1.1v respectively to no avail.

I just tried something new - popped out one stick of ram so I’m down to 8gb. The test ran for 10 minutes before running out of memory. That’s 8 minutes longer than ever before so maybe it’s ram after all or maybe the bug affects the memory controller?

Any recommendations on params to run with 8gb? 2 loops and 2 threads?

@disturbednny
Copy link

I'm in the same boat as you. My first R7 1700 was 1734 and had the bug show up. went through the RMA process with NewEgg and just received... 1734PGS ... same week, showing segfaults so far... lets see if it segfaults at 1.35 volts...

@m-r-s
Copy link

m-r-s commented Feb 24, 2018

Then you both probably got a faulty CPUs again.
We don't know what exactly is wrong, possibly something memory-related... the controller, or cache coherency.
We don't know how to distinguish good from bad ones (AMD did not tell us, maybe even they don't know).
The only tool we have is to run workloads on these CPUs which are likely to trigger the behavior.

I cannot understand how AMD gets away with this.
There must be thousands of faulty CPUs around, and they still sell them :(

I am deeply disappointed.

With 8Gb RAM better go for 3 loops 5 threads, or 2 loops 8 threads.

Good luck!

@jstarcher
Copy link
Author

Thanks! Yet a third CPU will be here tomorrow. I’m thinking it might be time to look at getting a class action suit together to get them talk. I used to love AMD but this is beyond ridiculous!

@disturbednny
Copy link

where do you live that NewEgg RMAs so fast? or did you buy from another source? what mobo and ram do you have? I'm even trying older BIOS versions to see if that might help.. hasn't so far. I use this PC for work and have lost hours to this, i need a place that will do an advanced replacement... or suck it up and get a zen+ processor when they come out. hopefully its not in that micro architecture as well...

@jstarcher
Copy link
Author

Newegg refused to exchange it because it’s past 30 days. I was fighting some other issues and decided to RMA the motherboard first. By time I switched it out the motherboard Newegg closed my RMA for the CPU. They also pissed me off because they wouldn’t take the motherboard back because I had sent the UPC in for the rebate.

This time I ordered from Amazon. I also use my PC for work as I work from home and couldn’t afford downtime. Newegg won’t do advanced replacement or returns on CPUs btw. I ended up filing a claim with my credit card for the return protection because of this mess.

I finally got back a response from AMD days later and they approved an RMA no questions asked and gave me a 2 day label.

Amazon was a once click exchange and they do advanced replacement so hopefully the one coming tomorrow is not bugged. If it is, I’ll go through the AMD RMA.

One way or another I’m getting to the bottom of this. I compile code on Linux for work and need this stable!

I’ll send that back an

@jstarcher
Copy link
Author

Motherboard is an ASRock X370 Taichi. I’ve tried different bios versions without luck. For memory I’ve got 2x8gb GSkill FlareX.

@disturbednny
Copy link

I have the same ram as you but have the aorus gaming k7. I think I'm going to wait until zen+ comes out to rma it, then sell it because i seem to have bad luck. At least this one is 100% stable with my RAMs xmp profile according to stressapptest and aida64

@jstarcher
Copy link
Author

Okay so I received my third CPU which is a 1744SUS this time and this script failed in about two minutes again with the segfault error. Given that this is the third post-week 25 chip I've had fail I'm pretty convinced at this point that something else is going on. Either something with the way this script runs on my machine (some weird thing when using zram?), memory settings, etc. I did experience some random lockups in both Windows and Linux without any MCE or BSODs with my first chip so I definitely thing that one had something wrong.

At this point I'm going to try running some real-world workloads and see if I can reproduce it. If so I'll dig deeper into the motherboard & ram. One other interesting note is that AMD told me "please update your motherboard BIOS to the latest version with AGESA 1.0.0.6b after installing the CPU" which I have but it does tell me something about the BIOS could be impacting this. I've been on 3.30 but I'll try a few other versions.

@jstarcher
Copy link
Author

A short update on the segfault saga: I've determined that disabling ASLR does indeed workaround the segfault issue or at least make it so I can't reproduce it with this script. I'm not sure I want to leave it disabled though as it is a small security risk running without it.

I also received my RMA replacement from AMD today and it's a 1733SUS. Funny thing is it seems to be very common to get this batch number when you RMA it for this issue so perhaps it's a "known good" batch. I'll get it installed this week and run some tests.

@jstarcher
Copy link
Author

Finally and end to all this. I was able to complete 12 hours of the ryzen test without any issues using the 1733SUS that AMD send as a replacement.

It's very peculiar that 3/3 of the retail purchases were bugged but AMD sent me a non-bugged item. I also notice MANY people are getting the 1733SUS back as a replacement. It makes me wonder if this is some sort of golden batch that is know working and AMD kept them to use for replacements. Meanwhile the other CPUs on the shelf are most likely bugged regardless of the week number, at least is my painful experiences.

So the answer is yes, this test is still valid. Thanks for putting this together and shame on AMD for selling known bugged CPUs!

@m-r-s
Copy link

m-r-s commented Mar 21, 2018

Thank you for sharing your story.
It is unbelievable that they get away with this...

@suaefar
Copy link
Owner

suaefar commented Mar 22, 2018

I will re-open this issue until it finally is no issue anymore...

@suaefar suaefar reopened this Mar 22, 2018
@disturbednny
Copy link

I'm witnessing something interesting.

When I run kill-ryzen.sh with no parameters it runs a lot longer before failing compared to kill-ryzen.sh 4 4 and it failing under two minutes is it just how the processor is being stressed that causes the difference in rate if failure? I'm waiting until the 2700x has been out for a while before buying that as my replacement, and to make sure others test it to make sure the segfault bug doesn't exist with the refrrsh

@Oxalin
Copy link
Contributor

Oxalin commented Apr 15, 2018

@disturbednny : what is the exact error you are hitting? Running 4 X 4 means you have for loops with 4 threads; without any parameter, you are running as many loops as there are threads on your CPU. Each loop will take longer to compile GCC. However, the stress will be similar or a bit higher with the latter. If it takes more time to fail with no parameters, that could indicate a problem with the compilation itself, not with the CPU.

@disturbednny
Copy link

I'll have to run them again to get the segfault errors, but they are kernel segfault checks that show up when I type dmesg, and follow the error format in the log entries jstarcher posted with the line starting with make[5]

@disturbednny
Copy link

Heres the dmesg output:
[KERN] Apr 15 22:40:00 ubuntu kernel: traps: bash[12803] general protection ip:435bc4 sp:7ffe2774fec0 error:0
[KERN] Apr 15 22:40:05 ubuntu kernel: bash[18352]: segfault at 6e61c4 ip 000000000043d790 sp 00007ffe54c53900 error 6 in bash[400000+100000]

loop-2 log
make[5]: *** [rint.lo] Segmentation fault (core dumped)

loop-0 log
Makefile:761: recipe for target 'set_ui.lo' failed
make[5]: *** [set_ui.lo] Segmentation fault (core dumped)

@m-r-s
Copy link

m-r-s commented Apr 17, 2018

This looks like you got a faulty Ryzen :(
I wonder how many are still out there producing erroneous results every day...

@suaefar
Copy link
Owner

suaefar commented Apr 27, 2018

Probably many.
People still report faulty CPUs as of week 48 in 2017: UA 1748PGS (https://community.amd.com/message/2857007#comment-2857007)

@disturbednny
Copy link

Some Good news,

I received my R7 2700X this past saturday, and successfully ran the kill-ryzen script for 8 hours straight with no segfault. So it looks like it is not present in the R7 and R5 2000 series. RMA'd my 1700 after installing the 2700X so we'll see what they give me.

@m-r-s
Copy link

m-r-s commented Apr 27, 2018

That's good news! I was really hoping that they would get it under control eventually.

@7Z0t99
Copy link

7Z0t99 commented May 12, 2018

Thank you very much for providing this test!
A few days ago, I got a Ryzen 5 1600 (lot 1743SUS), which failed the test in under 3 minutes.
The dealer was so kind to take it back and let me order a Ryzen 5 2600 (lot 1806SUT) as a replacement, which seems to work just fine.

@infoveinx
Copy link

Hi. So to be clear is disabling ASLR the answer to some of these issues? I've had a Ryzen 1700 1733PGS since Oct 2017 and the thing has been nothing but trouble. Dealing with this https://bugzilla.kernel.org/show_bug.cgi?id=196683 in additional to the general protection faults.

I am running latest AGESA. My memory has been tested ok. Basically I can reproduce a general protection fault very easily by just running something that uses several threads. For instance using Saltstack config management commands I could repo a fault just about every time I ran a somewhat intensive job with ASLR on. With ASLR off I get no protection faults.

Thanks!

@Oxalin
Copy link
Contributor

Oxalin commented Jun 18, 2018

@infoveinx : short answer is we don't know. As long as AMD won't recognize and disclose the problem, we can't tell for sure.

@infoveinx
Copy link

I've been building computers since the 1990s, and many of those have been specific for running Linux as either a deskstop or server for personal use. In all of those years I've never encountered the kind of problems I've seen with this CPU. I've used both AMD and Intel too. I feel like 2017/2018 have been the worst given these issues, and not to mention things like spectre/meltdown muddying the waters even more so.

AMD needs to open up about this issue, because it's quite obvious that there are real problems with this generation of processors. I know I'm preaching to the choir here. If you look at the link I posted above way down you'll see where folks have gone to several of the top techie news sites and reported some of the issues with these processors. They either get no response, or a response that states they aren't having the issues reported. I can't imagine this is the case with so many people reporting the same problems. It feels like one giant coverup if you ask me that even involves news and tech sites that test and write reviews on hardware.

@protox
Copy link

protox commented Jun 27, 2018

Basically AMD needs to release more info, because I don't think we fully understand this issue, hopefully 2700s are fixed as you say.

My UA1733PGS has been running with no issues since my other thread and upped voltages.

@m-r-s
Copy link

m-r-s commented Jun 27, 2018

Increased voltages mean increased power consumption, more heat, higher temperatures and possibly lower performance... it is a workaround but no fix.
Nobody should need to touch the stock voltages to get a stable system.

@protox
Copy link

protox commented Jun 27, 2018

Yes there is obviously an inherent issue, wouldn't be surprised if it's a design flaw in the end.

@skarr
Copy link

skarr commented Jul 20, 2018

suaefar, thanks for the effort from your side to help isolate and reproduce the problem.

I've got a 1700 and three 1800X CPUs. I bought my 1700 in March 2017, I expected problems with new micro architecture as we have seen in the past. I was surprised when I found my 1800X CPUs are from the first week of production even though I bought them individually in December 2017 to January 2018.

I tried to RMA one (1707) in April 2018. I can't afford to stop using all of them and I wasn't sure if replacements would work. AMD accepted my request (thanks to your "ryzen-test"), however when it came to shipping address I found out that AMD does not support my country. It looks like I am in it for the long run.

Until a month ago I did not have so many problems, but since I starting using newer kernels >= 4.15 I've seen multiple crashes per day, resulting in filesystem corruption beyond the point that fsck will repair. I've tried various distributions/kernels, CPU pinning, hugepages to try and isolate workloads in virtual machines. Thus far my best was 188 days uptime on my 1700 using pve-manager/5.0-23/af4267bf (running kernel: 4.10.15-1-pve) it's based on debian 9.4. Strange observation that my machine with the most memory did the best, it has 4x16GB RAM.

I know this project aims to determine if your hardware is faulty or not. I'm begging for any advice on a workaround that would make my system(s) stable without spending a huge sum of money i.e. buying new CPUs or Windows license for each machine. Can we pressure AMD to assist kernel devs or provide more information on what they changed between chip revisions?

@suaefar
Copy link
Owner

suaefar commented Jul 21, 2018

Hi skarr,

disabling ASLR, µOP-caching and SMT was on some occasions reported to increase stability.
Depending on your workload, it might also help to pin processes to certain CPUs (with "taskset"), but I am not sure.

Unfortunately, we know close to nothing because AMD never shared any bit of information on this issue with us.
You bought a product which does not work as expected.
I would simply return it.

@skarr
Copy link

skarr commented Jul 21, 2018

Thanks for the advice, I really appreciate it.

I found a product errata document by AMD in this post. This one stood out to me:

1109 MWAIT Instruction May Hang a Thread

Description
Under a highly specific and detailed set of internal timing conditions, the MWAIT instruction may cause a thread to hang in SMT (Simultaneous Multithreading) Mode.
Potential Effect on System
The system may hang or reset.
Suggested Workaround
System software may contain the workaround for this erratum.
Fix Planned
No fix planned

I also found new responses in Kernel.org Bugzilla stating idle=nomwait fixed all hangs. I am in the process of testing this for myself. My long term strategy is to try these/other workarounds while I continue to the fight with the local suppliers. Looks like reddit users are happy with RMA process which is not helping my case. Now as many people obtain newer CPUs it looks like this issue is going in under the carpet, "nothing to see here please disperse".

I will create a new issue/update this one if I find anything useful.

@jstarcher
Copy link
Author

@skarr have a look at this project: https://github.com/qrwteyrutiyoup/ryzen-stabilizator

Disabling C6, ALSR, and enabling the power supply idle workaround helped me. Without these even my replacement “not bugged” CPU had random reboots and soft pickups on Ubuntu. I created a systemd startup unit to make these changes automatically.

TL;DR there’s other problems with first gen Ryzen in Linux outside of this compilation bug :(

@infoveinx
Copy link

infoveinx commented Aug 3, 2018

Finally did the RMA. They sent me a UA1733SUS, the one I sent back was a UA1733PGS. The new one is also not behaving. I reset bios and I'm also running latest bios. The only change I made under advanced CPU section was to enable the typical current idle, and SMV. I run about 5 qemu-kvm VMs. Basically the ones that actually do stuff are crashing with kernel panics randomly (just like before). If you let one sit long enough in panic state without killing it the host system will eventually have some kind of kernel issue and lock up. Generally the system is close to idle though as the VMs don't have much activity.

To be clear I'm now on a 4.17 kernel from Debian Stretch backports. I've used 4.12, 4.13, 4.14, 4.15, and 4.16 prior. All of them unstable (although once upon a time 4.13 had a long uptime) but that was after disabling C-states both in bios and in software, and disabling ASLR. Subsequent kernel versions didn't seem to make a difference even with all of that disabled. Even with 4.13 every so often I would see a VM go to 100% and have to be restarted, though much less frequent. I also had a much older BIOS at that time.

During the RMA process I moved all of the VM images back to an old 2012 Intel i3 that I had used prior. Not a single problem from that system in the week that I ran it and at times under heavy load. I was going to try a bunch of stuff like, CPU pinning with VMs etc, but I've read other folks tried that and it still crashed. I'm not going to continue trying to make this work. I'm just not going to buy AMD ever again. In fact if I must I will purchase older gen processors after doing research to ensure they can handle running VMs under Linux.

This has been a fight since November 2017 and I've lost countless hours. If anyone has suggestions I'm open to them, but at this point it seems like a lost fight and time to move on.

@infoveinx
Copy link

Update. The initial crashes that I encountered with the replacement CPU were still under 4.16 kernel. The only incident I had with 4.17 kernel was starting the 5 linux VMs simultaneously and then one of them crashed not long after startup.

Something else I did was disable IOMMU in bios almost two days ago and I can't quite recall if I did this prior to the single 4.17 kernel incident. I let it sit mostly idle for a little over one day and didn't experience a crash or idle lockup. Today I tried to use every method prior to crash it and it never experienced a single hiccup. I'm not sure what to make of it yet so going to let it go longer and see what happens. Unfortunately in the past I've seen it crash anywhere from within minutes, to multiple days.

@infoveinx
Copy link

Latest update. System went 7 days without issue but is now back to being completely unstable. During those 7 days I ran it through a gambit of things from normal tasks, to many simultaneous things involving network I/O, disk both SATA and USB 3.0 transfers along with some stress-ng runs. I had 6 Linux vms running on it and it never hinted a single issue. I find it on 7th day locked up with a kernel panic. Since then it is completely unstable, with or without vms running. It's very hard for me to understand how it can run so perfect and the suddenly become so unstable with no changes.

Some thoughts on this. I would assume if the mobo were bad that the behavior would have shown well before 7 days. The same for PSU. I did run RAM through an 8 hour memtest some time back and saw no problems. The only conclusion I can come to here is that the Linux kernel itself is just not working well with this CPU for whatever reason. The other thing I wondered is if perhaps something about the mobo is providing incorrect voltages and over time is degrading the CPU in someway.

I can't really deal with this any longer so I think for now the system will just get shelved and replaced with a previous gen Intel. I've been reading around again and I see several folks who are essentially dealing with the same kind of conditions. Ie, stable then completely unstable with general protection faults/segfaults and basic system lockups etc. Yes it does seem like many people who RMA are getting 2017 Week 33 replacements.

@jstarcher
Copy link
Author

Have you tried everything outlined here: http://blog.programster.org/stabilizing-ubuntu-16-04-on-ryzen

These changes seemed to really help me. At this point the only thing I get is a soft lockup on occasion which I’m 99% sure it’s an Nvidia driver issue. I can ssh to the machine and Xorg is locked up and I have errors in the Xorg log but I haven’t seen evidence of CPU instability. Also try with completely stock ram settings, not XMP profile. XMP isn’t guaranteed to be stable. Worth upping soc voltage to 1.1v if you haven’t yet as well.

@infoveinx
Copy link

Yeah, thanks. I've tried all of those things and then some. The longest run I had was on kernel 4.13 (I'm running Debian 9.x so I'm using backports to get newer kernels). Basically it came down to disable C-states in bios, disable the remaining states via the Zenstates script, disable ASLR, and finally blacklisting nouveau driver. I went over 100 days uptime with that, however I still had the occasional VM lockup when I would do a heavy file transfer over network on the host system itself (not a VM on said system). Since then I've updated to latest bios and iterated over kernels 4.14, 4.15, 4.16, and now on 4.17.

I have also tried not doing XMP with no change in results. Running XMP is showing the RAM rated timings in bios fwiw. That 100+ days of uptime was also on the bios that came with mobo which was quite old and prior to the addition of the power supply configurable idle states that AMD added. I haven't had a soft lockup in a very long time, my issues all appear to be related to memory now. I also tried to disable SMT, Opcache, etc. In the end the only way to keep it working was to pass maxcpus=1 to kernel so that only a single CPU was used. In that case I was able to copy files across network (Samb) and back and forth to a USB 3.0 drive (as backup) with no crashes.

I'm not running Xorg or any GUI on this system, but I do have a Geforce 210 as the video card. I've yet to see kernel errors related to video. I tried adding voltage slowly and that seemed to increase problems (on the returned CPU), but I am willing to try it again with new CPU.

For clarity here are details of my system.

Gigabyte AB350-Gaming 3 Bios F23d
CORSAIR CX-M Series CX550M 550W PSU
Ryzen 7 1700
G.SKILL Flare X Series 32GB (4 x 8GB) 288-Pin DDR4 SDRAM DDR4 2400
Geforce 210 video card
Intel EXPI9301CTBLK Network Adapter 10/100/1000Mbps PCI-Express
SAMSUNG 850 PRO 512GB SSD
HGST Deskstar NAS 3.5" 10TB x2 running in Raid 1
Debian 9.5 kernel 4.17 running on the Samsung SSD

@infoveinx
Copy link

Also noticing these in boot log. Don't recall seeing them prior. I know there has been talk of fixes related to this in newer kernels. No idea how it relates but as stated prior I'm already on 4.17 kernel.

Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

@infoveinx
Copy link

Did the following.

  • Reset bios to optimized defaults
  • Turned on SMV
  • Set PSU to typical current
  • Made sure XMP Profile disabled
  • Set VCORE SOC to 1.116 (fluctuates from 1.104 - 1.128)
  • Disabled IOMMU
  • CSM mode with UEFI for boot devices

System is still randomly unstable. Had a hard locked CPU related to KVM, random segfaults may or may not happen after each reboot while trying to run a command that I know generates them sometimes. Still boggles my mind that it went 7 days with no issue running all kinds of tasks.

No vms running, got these when I went to copy the qcow2 images to the Raid 1 to start backup and rebuild process with another system.

page:fffff89a1dd21880 count:0 mapcount:0 mapping:0000000000000f00 index:0x1

I rebooted and set SOC back to Auto and then copied the 67G worth of qcow2 to the Raid1 no problem. Maybe there is just some kind of voltage regulator problem here I'm not sure. I'll mess with it on the side while I have the hopefully stable replacement up.

I put a load of 21 on it last night via converting some h.265 to h.264 video with ffmpeg, all while running other things in a loop to try to break it, and of course it had zero problems. Running VMs though is a matter of time (and much shorter time lately).

@m-r-s
Copy link

m-r-s commented Aug 15, 2018

random segfaults may or may not happen after each reboot while trying to run a command that I know generates them sometimes

This sounds familiar...
My advice: "If it does not run stable with stock settings, save the time and RMA it."

@infoveinx
Copy link

This sounds familiar...
My advice: "If it does not run stable with stock settings, save the time and RMA it."

Unfortunately this is with the RMA CPU. I sent in a 1733PGS and got a 1733SUS back.

@jstarcher
Copy link
Author

I agree, might be time to RMA the motherboard and/or RAM. Maybe try one stick of ram at a time to try to isolate if you have a bad stick.

@v0idwalker
Copy link

There is no official information on which CPUs are affected (or not affected).
Your description here does fit the ryzen segfault bug.
Prime95 and Memtest86 are not (as) sensitive to the bug as this workload.
If you hit a segfault that rapidly (less than 5 minutes) and only one or a few processes fail (and some continue running), then your CPU is probably affected.
If all processes fail within a short period it may be due to another problem.

You can still check the build logs in /mnt/ramdisk/workdir for problems other than a faulty CPU.

Helo, just a quick question.
The memtest86 can be influenced by this bug? Can memtest86 trigger this bug and make the ram look faulty?

@m-r-s
Copy link

m-r-s commented Nov 8, 2018

Theoretically, yes.
But one of the particular observations was that memtest86 ran fine on the faulty CPUs while the compilation of GCC failed.

@jstarcher
Copy link
Author

jstarcher commented Nov 8, 2018

Same experience here. Memtest ran overnight without finding any errrors. This bug requires heavy CPU usages across all threads to trigger which memtest doesn’t do.

That isn’t to say it is impossible for it to cause it to fail though.

@v0idwalker
Copy link

Well, I am sending my 1700x for rma. Meanwhile I borrowed a 2600 and will check if the problem persist.
Was this bug observed on Zen+ too? (I have an ASRock taichi x470, so there should be no problem with compatibility.

Also, what is the expected final step of this script?

@doug65536
Copy link

doug65536 commented Mar 5, 2023

If you have this problem today, you can disable the uop cache in AMD CBS settings in your BIOS (UEFI settings). uop as in micro-op, as in mu-op (μop). It hardly reduces performance if you disable it, but it completely fixes this issue. The code has to be tons of huge instructions to even be able to measure a difference in performance. The instruction decoder is so good, you hardly even need the uop cache. I didn't see "uop" mentioned in this thread, hopefully this isn't redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests