I went to FOSDEM again this year, which is an excellent two-day conference for open source developers, run by volunteers, and is free to attend.
You can find my very rough notes from the talks I attended on the Saturday below. I also have notes from the Sunday.
If you want to watch the recorded talks, see the FOSDEM videos.
Brandon Philips (CoreOS) - Containing Infrastructure: The Internet on Kubernetes
- 3.5B Internet users
- 29k software devs/IT practitioners
- we’re outnumbered
- most people are uploading their data to servers
- our responsibility to look after it
- ~100M servers worldwide
- 3 per person
SDN/NFV room - OpenFV
https://www.opnfv.org/
Raymond Knopp - 4G/5G
- 5G coming soon
- Going to introduce a lot of complexity to cellular networks
- http://telecominfraproject.com/ (Google, FB and others)
- ITU IMT2020 FG Vision
- C-RAN is where compute is
Bert Vermeulen - Switchdev kernel subsystem
https://fosdem.org/2017/schedule/event/switchdev/
- What is an Ethernet switch? Forwards ethernet frames
- FDB forwarding database (learning, aging)
- https://en.wikipedia.org/wiki/Forwarding_information_base
- need CPU, used to cause an interrupt for each frame received
- OpenWRT swconfig (small CPUs, builtin switch chip)
- DSA switch chip with mainline kernel supports 5 ports; bridging offloaded to hardware
Ray Kinsella (Intel) - TLDK Overview
https://fosdem.org/2017/schedule/event/tldk/ https://fd.io/
- built on http://dpdk.org/
- On cellular network
- TCP connections establishment is as important as througput - most TCP flows are small
- 60% of flows are smaller than 4KB, only 9.5% are larger than 32KB
- TCP connections establishment is as important as througput - most TCP flows are small
- 97% of traffic not P2P on non-cellular network
- some companies are using multiple TCP stacks, drvien by dev desire to take advantage of new TCP options
- run well-known kernel stack on top of netmap, OpenFastPath, DPDK
- gives broad RFC compliance
- BSD Socks API
- kernel stacks assume they’re running in the kernel - requires work to make them run in userspace
- core locality - eliminate context switching by processing packets on same core as data stream
- what is TLDK?
- high performance L4 implementation on top of DPDK
- from scratch, not using existing kernel stack
- optimised for performance
- not same level of RFC compliance
- aimed to be fastest TCP implementation on top of general purpose CPU
- cache coherency - end-to-end affinity
- avoid swapping cache lines between cores
- avoids context switching (in userland)
- support for common HW offloads
- socket-like API calls where possible
- scales linearly with physical cores
- maxes out PCI at 5 cores using UDP packets of 64 bytes (36.4 Mpps)
- TCP transparent proxy
- sends ACKs on behalf of the client
- caches TCP packets close to the network
- reduces retransmission and buffer bloat
- ray.kinsella@intel.com
Ferruh Yigit (Intel) - How to write a simple DPDK forwarding application
https://fosdem.org/2017/schedule/event/dpdkapp/
- DPDK designed to address the problem of packet sizes versus line rate
- small packets mean a lot of packets on interfaces with high line rates
- does not make good use of CPU caches
- small packets mean a lot of packets on interfaces with high line rates
- needs hugepages
- need DPDK-compatible NIC drivers
- EAL: Environment Abstraction Layer
- abstracts the hardware
- scans for devices
- DPDK manages memory itself (cannot use OS memory allocation APIs)
- DPDK mempool: fixed size memory buffers
- use multiple queues for performance (process each queue on separate cores)
Klaus Aehlig (Google) - Bazel
- build tool (like Make); derives files from source files
- core of Google tool for over a decade (Google uses extensions)
- optimised for Google’s use case
- a large single repository (AKA mono-repo)
- open source since 2015
- why Bazel?
- aggressive caching while retaining correctness
- as if built freshly from source
- aggressive caching while retaining correctness
- declarative
- separation of concerns
- write code rather than choose correct compiling strategy
- separation of concerns
- how does it work?
- load the BUILD files needed
- generates action graph based on dependencies
- execute actions if cache miss
- client-server architecture to keep graph in memory
- example
- WORKSPACE file indicates root of codebase (can be an empty file)
- BUILD file declaring dependencies
- CC, link options, host/target architecture taken care of elsehwere
- sandboxing to help detect incorrect builds
- remote execution (build cluster)
- enables shared caches
- remote execution (build cluster)
- Skylark language for extending BUILD language (Python-like)
- can be used to support builds for other languages
- challenges of open-sourcing
- Google-specific internal dependencies
- open source dependencies or find replacements
- focus on languages used at Google
- no stable interfaces, easy to change APIs on large code base but not appropriate for an open-source project
- hard-coded paths everywhere
- process of open-sourcing still ongoing
- Bazel 1.0 (‘properly’ open source - expected 2018)
- public repository will be primary repository
- clear interfaces between Bazel and Google’s use
- all design reviews in public
- core team currently all at Google
- working towards stable build language and APIs
- remote execution API being improved upon
- Google-specific internal dependencies
Andrea Arcangeli (Red Hat, Inc) - 20 years of Linux Virtual Memory
- what are virtual pages?
- map virtual memory to physical
- virtual pages cost ‘nothing’
- practically unlimited on 64bit architectures
- x86 code is implemented as a radix tree
-
Total:
grep PageTables /proc/meminfo
- Page tables are 4KB in size
- ‘fabric’ of virtual memory = all data structures that abstract the hardware:
- tasks
- processes
- virtual memory areas
- mmap
- algorithms for these data structures are the virtual memory heuristics
- no perfect solution
- when’s the right time to unmap pages (swapiness?)
- no perfect solution
- all free memory is used as a cache
- we overcommit by default (though not excessively by default)
- page struct size = 64 bytes (one per page)
- mm_struct AKA MM
- describes memory of the process
- shared by threads
- vm_area_struct AKA VMA
- Virtual Memory Area
- created and teardown using mmap and munmap
- Virtual Memory Area
- page reclaim clock algorithm
- reclaiming required scanning all pages to look for a candidate to free
- clock algorithm replaced with Least Recently Used (LRU) list
- two LRUs
- active and inactive
- introduced in 2001
- common use case: backing up requires streaming I/O
- grep -i active /proc/meminfo
- two LRUs
- rmap obsoleted the pgtable scan clock algorithm
- rmap = reverse mapping
- provides a way to map all of the pagetables of any given physical page without having to scan them all
- there many other LRUs (e.g. every memory cgroup has its own LRU)
- separate LRU for anonymous and file-backed mappings
- numactl (automatic NUMA balancing)
- ‘transparent hugepages’
- https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-transhuge.html
- virtual memory in hypervisors
- how can we take advantage of virtual memory in hypervisors?
- KVM philosophy
- reuse Linux code as much as possible
- kernel module + mmu/sched notifier
- numa balancing: /proc/sys/kernel/numa_balancing
- hugepages:
- larger than 4KB pages
- 512 times bigger -> 2MiB
- benefits?
- enlarge TLB size (essential for KVM)
- speed up TLB misses (essential for KVM)
- faster to allocate memory initially (minor)
- transparent hugepages (Red Hat specific?) hide some of the complexity of hugepages from the system administrator
- memory externalisation: use memory on a remote node
- http://www.orbitproject.eu/tutorials-documentation-how-to-use-the-results-of-this-projects/
Misc notes
https://en.wikipedia.org/wiki/Multipath_TCP