Kernel bug causes ethernet driver to stop working

One of my servers keeps getting unreachable because of a kernel bug. I tried all the kernel versions mentioned below but unfortunately none of them fixed this issue.

What can I do to resolve this issue?


Ubuntu Version: Ubuntu 16.04.3 LTS

Kernel Versions:

  • 4.13.0
  • 4.14.17
  • 4.15.2
  • 4.15.3

NIC:

00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
        Subsystem: Fujitsu Technology Solutions Ethernet Connection (2) I219-LM
        Kernel driver in use: e1000e
        Kernel modules: e1000e

syslog:

Feb 16 09:26:19 foxtrot kernel: [ 6315.103309] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Feb 16 09:26:46 foxtrot kernel: [ 6341.860523] e1000e 0000:00:1f.6 eth0: Reset adapter unexpectedly
Feb 16 09:26:46 foxtrot kernel: [ 6341.880459] ------------[ cut here ]------------
Feb 16 09:26:46 foxtrot kernel: [ 6341.880461] kernel BUG at /home/kernel/COD/linux/drivers/net/ethernet/intel/e1000e/netdev.c:3836!
Feb 16 09:26:46 foxtrot kernel: [ 6341.880609] invalid opcode: 0000 [#1] SMP PTI
Feb 16 09:26:46 foxtrot kernel: [ 6341.880702] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 xt_addrtype nf_nat br_netfilter bridge stp llc xt_tcpudp overlay nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd intel_cstate intel_rapl_perf serio_raw intel_pch_thermal mac_hid acpi_pad autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 e1000e psmouse ptp ahci pps_core libahci wmi video
Feb 16 09:26:46 foxtrot kernel: [ 6341.881046] CPU: 7 PID: 72 Comm: kworker/7:1 Tainted: G        W        4.15.3-041503-generic #201802120730
Feb 16 09:26:46 foxtrot kernel: [ 6341.881156] Hardware name: FUJITSU  /D3401-H2, BIOS V5.0.0.12 R1.8.0 for D3401-H2x                     05/15/2017
Feb 16 09:26:46 foxtrot kernel: [ 6341.881275] Workqueue: events e1000_reset_task [e1000e]
Feb 16 09:26:46 foxtrot kernel: [ 6341.881373] RIP: 0010:e1000_flush_desc_rings+0x2cb/0x2e0 [e1000e]
Feb 16 09:26:46 foxtrot kernel: [ 6341.881465] RSP: 0018:ffff9ff6033f3d88 EFLAGS: 00010202
Feb 16 09:26:46 foxtrot kernel: [ 6341.881555] RAX: 00000000000000d3 RBX: ffff8f0d2ee048c0 RCX: 00000000000000e9
Feb 16 09:26:46 foxtrot kernel: [ 6341.881648] RDX: 00000000000000d3 RSI: 0000000000000246 RDI: 0000000000000246
Feb 16 09:26:46 foxtrot kernel: [ 6341.881742] RBP: ffff9ff6033f3dc0 R08: 0000000000000002 R09: ffff9ff6033f3d54
Feb 16 09:26:46 foxtrot kernel: [ 6341.881835] R10: 00000000000000fe R11: 0000000000000000 R12: 000000003103f0fa
Feb 16 09:26:46 foxtrot kernel: [ 6341.881946] R13: ffff8f0d2ee04d78 R14: ffff8f0d39ca9480 R15: 0000000004008000
Feb 16 09:26:46 foxtrot kernel: [ 6341.882071] FS:  0000000000000000(0000) GS:ffff8f0d5e5c0000(0000) knlGS:0000000000000000
Feb 16 09:26:46 foxtrot kernel: [ 6341.882263] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 16 09:26:46 foxtrot kernel: [ 6341.882387] CR2: 00007fd08b9f7fd7 CR3: 0000000700a0a001 CR4: 00000000003606e0
Feb 16 09:26:46 foxtrot kernel: [ 6341.882481] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 16 09:26:46 foxtrot kernel: [ 6341.882661] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 16 09:26:46 foxtrot kernel: [ 6341.882787] Call Trace:
Feb 16 09:26:46 foxtrot kernel: [ 6341.882878]  e1000e_reset+0x516/0x760 [e1000e]
Feb 16 09:26:46 foxtrot kernel: [ 6341.882968]  e1000e_down+0x1db/0x210 [e1000e]
Feb 16 09:26:46 foxtrot kernel: [ 6341.883064]  e1000e_reinit_locked+0x4c/0x70 [e1000e]
Feb 16 09:26:46 foxtrot kernel: [ 6341.883156]  e1000_reset_task+0x59/0x60 [e1000e]
Feb 16 09:26:46 foxtrot kernel: [ 6341.883250]  process_one_work+0x1ef/0x410
Feb 16 09:26:46 foxtrot kernel: [ 6341.883338]  worker_thread+0x32/0x410
Feb 16 09:26:46 foxtrot kernel: [ 6341.883419]  kthread+0x121/0x140
Feb 16 09:26:46 foxtrot kernel: [ 6341.883506]  ? process_one_work+0x410/0x410
Feb 16 09:26:46 foxtrot kernel: [ 6341.883594]  ? kthread_create_worker_on_cpu+0x70/0x70
Feb 16 09:26:46 foxtrot kernel: [ 6341.883685]  ret_from_fork+0x35/0x40
Feb 16 09:26:46 foxtrot kernel: [ 6341.883772] Code: e8 fb fc ff ff eb d6 4c 89 ef e8 f1 fc ff ff eb 95 4c 89 ef e8 e7 fc ff ff e9 66 ff ff ff 4c 89 ef e8 da fc ff ff e9 02 ff ff ff <0f> 0b e8 5e fb 13 d8 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 
Feb 16 09:26:46 foxtrot kernel: [ 6341.883949] RIP: e1000_flush_desc_rings+0x2cb/0x2e0 [e1000e] RSP: ffff9ff6033f3d88
Feb 16 09:26:46 foxtrot kernel: [ 6341.884056] ---[ end trace abbf45ab36b73ab9 ]---
Feb 16 09:28:38 foxtrot autossh[1513]: ssh exited with error status 255; restarting ssh
Feb 16 09:28:38 foxtrot autossh[1513]: starting ssh (count 2)
Feb 16 09:28:38 foxtrot autossh[1513]: ssh child pid is 20383
Feb 16 09:28:40 foxtrot autossh[1513]: ssh exited with error status 255; restarting ssh
Feb 16 09:28:40 foxtrot autossh[1513]: starting ssh (count 3)

Here is Solutions:

We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.

Solution 1

I was able to resolve this issue by disabling TSO, GSO and GRO with the following command (This command needs to be run again after a server reboot, it can also be added to rc.local):

ethtool -K eth0 gso off gro off tso off

It has been over 6 months now and the issue didn’t occur again since I disabled this.

Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply