Changelog History
Page 1
-
v1.11.0 Changes
๐ This release contains the following highlights:
Mesos Containerizer now supports using pre-provisioned external CSI storage volumes by means of the new
volume/csi
isolator; the latter significantly extends the range of compatible 3rd party CSI plugins compared to the already existing SLRP-based solution (MESOS-10141).The Scheduler API adds an interface allowing frameworks to put constraints on agent attributes in resource offers to help "picky" frameworks significantly reduce scheduling latency when close to being out of quota (MESOS-10161).
The CMake build becomes usable for deploying in production (MESOS-898).
โ Additional API Changes:
- Breaking change Deprecated authentication credential text format support.
๐ Unresolved Critical Issues:
- [MESOS-10194] - Mesos master failure "Check failed: 'get_(role)' Must be SOME"
- [MESOS-10186] - Segmentation fault while running mesos in SSL mode
- [MESOS-10146] - Removing task from slave when framework is disconnected causes master to crash
- [MESOS-10066] - mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
- [MESOS-10011] - Operation feedback with stale agent ID crashes the master
- [MESOS-9967] - Authorization header is missing when using a default registry
- [MESOS-9579] - ExecutorHttpApiTest.HeartbeatCalls is flaky.
- [MESOS-9536] - Nested container launched with non-root user may not be able to write to its sandbox via the environment variable
MESOS_SANDBOX
- [MESOS-9500] - spark submit with docker image on mesos cluster fails.
- [MESOS-9426] - ZK master detection can become forever pending.
- [MESOS-9393] - Fetcher crashes extracting archives with non-ASCII filenames.
- [MESOS-9365] - Windows - GET_CONTAINERS API call causes the Mesos agent to fail
- [MESOS-9355] - Persistence volume does not unmount correctly with wrong artifact URI
- [MESOS-9352] - Data in persistent volume deleted accidentally when using Docker container and Persistent volume
- [MESOS-9053] - Network ports isolator can falsely trigger while destroying containers.
- [MESOS-9006] - The agent's GET_AGENT leaks resource information when using authorization
- [MESOS-8840] -
cpu.cfs_quota_us
may be accidentally set for command task using docker during agent recovery. - [MESOS-8803] - Libprocess deadlocks in a test.
- [MESOS-8679] - "If the first KILL stuck in the default executor, all other KILLs will be ignored."
- [MESOS-8608] - RmdirContinueOnErrorTest.RemoveWithContinueOnError fails.
- [MESOS-8257] - "Unified Containerizer ""leaks"" a target container mount path to the host FS when the target resolves to an absolute path"
- [MESOS-8256] - Libprocess can silently deadlock due to worker thread exhaustion.
- [MESOS-8096] - Enqueueing events in MockHTTPScheduler can lead to segfaults.
- [MESOS-8038] - Launching GPU task sporadically fails.
- [MESOS-7971] - PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
- [MESOS-7911] - Non-checkpointing framework's tasks should not be marked LOST when agent disconnects.
- [MESOS-7748] - Slow subscribers of streaming APIs can lead to Mesos OOMing.
- [MESOS-7721] - Master's agent removal rate limit also applies to agent unreachability.
- [MESOS-7566] - Master crash due to failed check in DRFSorter::remove
- [MESOS-7386] - Executor not cleaning up existing running docker containers if external logrotate/logger processes die/killed
- [MESOS-6285] - Agents may OOM during recovery if there are too many tasks or executors
- [MESOS-5989] - Libevent SSL Socket downgrade code accesses uninitialized memory / assumes single peek is sufficient.
All Resolved Issues:
** ๐ Bug * [MESOS-7485] - Add verbose logging for curl commands used in fetcher/puller * [MESOS-7834] - CMake does not set default --launcher_dir correctly * [MESOS-9609] - Master check failure when marking agent unreachable. * [MESOS-10126] - Docker volume isolator needs to clean up the
info
struct regardless the result of unmount operation * [MESOS-10134] - Race between concurrentjavah
runs trying to createjava/jni
output directory. * [MESOS-10137] - Mesos failed to build due to error C2668 on windows with MSVC * [MESOS-10169] - Reintroduce image fetch deduplication while keeping it possible to destroy UCR containers in PROVISIONING state. * [MESOS-10192] - Recent Nvidia CUDA changes break Mesos GPU support** Epic * [MESOS-898] - Introduce CMake as an alternative build system. * [MESOS-10141] - CSI External Volume Support * [MESOS-10161] - Constraints-based offer filtering
** ๐ Improvement * [MESOS-6692] - Install module dependencies during build * [MESOS-6771] - Add and vet
install
target** Task * [MESOS-10142] - CSI External Volumes MVP Design Doc * [MESOS-10147] - Introduce a new volume type
CSI
into theVolume
protobuf message * [MESOS-10148] - Update theCSIPluginInfo
protobuf message for supporting 3rd party CSI plugins * [MESOS-10149] - Improve CSI service manager to support unmanaged CSI plugins * [MESOS-10150] - Refactor CSI volume manager to support pre-provisioned CSI volumes * [MESOS-10151] - Introduce a new agent flag--csi_plugin_config_dir
* [MESOS-10152] - Implement thecreate
method of thevolume/csi
isolator * [MESOS-10153] - Implement theprepare
method of thevolume/csi
isolator * [MESOS-10154] - Implement thecleanup
method of thevolume/csi
isolator * [MESOS-10155] - Implement therecover
method of thevolume/csi
isolator * [MESOS-10156] - Enable thevolume/csi
isolator in UCR * [MESOS-10157] - Add documentation for thevolume/csi
isolator * [MESOS-10162] - Constraints-based offer filtering design doc * [MESOS-10163] - Implement a new component to launch CSI plugins as standalone containers and make CSI gRPC calls * [MESOS-10166] - Avoid sending framework updates to agents and subscribers when frameworkInfo/pid didn't change. * [MESOS-10168] - Add secrets support to the CSI volume managers * [MESOS-10170] - Bundle RE2 into Mesos * [MESOS-10171] - Groundwork for constraints-based filtering usingExists/NotExists
attribute constraint as an example. * [MESOS-10172] - Add offer constraints on (pseudo)attribute value equality * [MESOS-10173] - Add offer constraints on (pseudo)attribute (not) matching RE2 regex * [MESOS-10175] - Improve CSI service manager to set node ID for managed CSI plugins * [MESOS-10177] - Add an endpoint for offer constraints debug * [MESOS-10179] - Expose framework's OfferConstraints via master API endpoints * [MESOS-10189] - Pass offer constraints through the V0 scheduler driver and its Java bindings.** ๐ Documentation * [MESOS-10193] - Add documentation for offer constraints.
-
v1.10.1 Changes
- ๐ This is a bug fix release.
** ๐ Bug
- [MESOS-9609] - Master check failure when marking agent unreachable.
- [MESOS-10126] - Docker volume isolator needs to clean up the
info
struct regardless the result of unmount operation - [MESOS-10134] - Race between concurrent
javah
runs trying to createjava/jni
output directory. - [MESOS-10169] - Reintroduce image fetch deduplication while keeping it possible to destroy UCR containers in PROVISIONING state.
-
v1.10.0 Changes
May 28, 2020๐ This release contains the following highlights:
Container resource bursting has been supported on Linux. Frameworks are now able to specify CPU and memory limits for tasks (separately from resource requests) and also the level of isolation they desire when launching task groups - CPU and memory may be isolated at the executor container level, or the task container level (MESOS-10001).
Executors can now use a Unix domain socket to connect to an agent, instead of connecting via TCP (MESOS-10034).
Existing reservations can now be modified via the RESERVE_RESOURCES master API call (MESOS-9981).
Performance of read-only V1 operator API calls has been improved by introducing direct serialization into JSON/protobuf and extending the batching mechanism to parallel processing of these calls by the master (similarly to
/state
endpoint). This brings V1 operator API performance on par with older HTTP endpoints (MESOS-10026, MESOS-9497).Breaking change for authorizer modules: authorizers are now required to implement a method for returning
ObjectApprover
s that are valid throughout all of their lifetime. For framework and operator API subscriber principals the set ofObjectAprover
s is now requested from the authorizer only once per subscription (MESOS-10056, MESOS-10057).
โ Additional API Changes:
- Quota can now be set on the default
*
role. - Quota consumption metrics are now exposed by the allocator.
๐ Unresolved Critical Issues:
- [MESOS-10066] - mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
- [MESOS-10011] - Operation feedback with stale agent ID crashes the master
- [MESOS-9967] - Authorization header is missing when using a default registry
- [MESOS-9609] - Master check failure when marking agent unreachable
- [MESOS-9579] - ExecutorHttpApiTest.HeartbeatCalls is flaky.
- [MESOS-9536] - Nested container launched with non-root user may not be able to write to its sandbox via the environment variable
MESOS_SANDBOX
- [MESOS-9500] - spark submit with docker image on mesos cluster fails.
- [MESOS-9426] - ZK master detection can become forever pending.
- [MESOS-9393] - Fetcher crashes extracting archives with non-ASCII filenames.
- [MESOS-9365] - Windows - GET_CONTAINERS API call causes the Mesos agent to fail
- [MESOS-9355] - Persistence volume does not unmount correctly with wrong artifact URI
- [MESOS-9352] - Data in persistent volume deleted accidentally when using Docker container and Persistent volume
- [MESOS-9053] - Network ports isolator can falsely trigger while destroying containers.
- [MESOS-9006] - The agent's GET_AGENT leaks resource information when using authorization
- [MESOS-8840] -
cpu.cfs_quota_us
may be accidentally set for command task using docker during agent recovery. - [MESOS-8803] - Libprocess deadlocks in a test.
- [MESOS-8679] - "If the first KILL stuck in the default executor, all other KILLs will be ignored."
- [MESOS-8608] - RmdirContinueOnErrorTest.RemoveWithContinueOnError fails.
- [MESOS-8257] - "Unified Containerizer ""leaks"" a target container mount path to the host FS when the target resolves to an absolute path"
- [MESOS-8256] - Libprocess can silently deadlock due to worker thread exhaustion.
- [MESOS-8096] - Enqueueing events in MockHTTPScheduler can lead to segfaults.
- [MESOS-8038] - Launching GPU task sporadically fails.
- [MESOS-7971] - PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
- [MESOS-7911] - Non-checkpointing framework's tasks should not be marked LOST when agent disconnects.
- [MESOS-7748] - Slow subscribers of streaming APIs can lead to Mesos OOMing.
- [MESOS-7721] - Master's agent removal rate limit also applies to agent unreachability.
- [MESOS-7566] - Master crash due to failed check in DRFSorter::remove
- [MESOS-7386] - Executor not cleaning up existing running docker containers if external logrotate/logger processes die/killed
- [MESOS-6285] - Agents may OOM during recovery if there are too many tasks or executors
- [MESOS-5989] - Libevent SSL Socket downgrade code accesses uninitialized memory / assumes single peek is sufficient.
All Resolved Issues:
** ๐ Bug * [MESOS-621] -
HierarchicalAllocatorProcess::removeSlave
doesn't properly handle framework allocations/resources * [MESOS-4996] - 'containerizer->update' will always fail after killing a docker container. * [MESOS-7217] - CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs is flaky. * [MESOS-7639] - Oversubscription could crash the master due to CHECK failure in the allocator * [MESOS-8537] - Default executor doesn't wait for status updates to be ack'd before shutting down * [MESOS-8877] - Docker container's resources will be wrongly enlarged in cgroups after agent recovery * [MESOS-9337] - Hook manager implementation is missing mutex acquisition in several places. * [MESOS-9847] - Docker executor doesn't wait for status updates to be ack'd before shutting down. * [MESOS-9889] - Master CPU high due to unexpected foreachkey behaviour in Master::__reregisterSlave. * [MESOS-9958] - New CLI is not included in distribution tarball * [MESOS-9965] - agent should not sendTASK_GONE_BY_OPERATOR
if the framework is not partition aware. * [MESOS-9968] - WWWAuthenticate header parsing fails when commas are in (quoted) realm * [MESOS-9971] - 'dist' and 'distcheck' cmake targets are implemented as shell scripts, so fail on Windows/MSVC. * [MESOS-9975] - Sorter may leak clients allocations. * [MESOS-9978] - Nvml isolator cannot be disabled which makes it impossible to exclude non-free code * [MESOS-9980] - HierarchicalAllocatorTest.MaintenanceInverseOffers is flaky * [MESOS-10007] - Command executor can miss exit status for short-lived commands due to double-reaping. * [MESOS-10008] - Very large quota values can crash master. * [MESOS-10015] - updateAllocation() can stall the allocator with a huge number of reservations on an agent. * [MESOS-10018] - Duplicate tasks if agent partitioned during maintenance down * [MESOS-10023] - Allocator method dispatches can be reordered (relative to scheduler API calls which triggered them). * [MESOS-10041] - Libprocess SSL verification can leak memory * [MESOS-10083] - Authorizing invalid operation can result in declined authorization. * [MESOS-10084] - Detecting whether executor is generated for command task should work when the launcher_dir changes * [MESOS-10090] - Mesos build on Windows appears to be broken. * [MESOS-10092] - Cannot pull image from docker registry which does not reply with 'scope'/'service' in WWW-Authenticate header * [MESOS-10094] - Master's agent draining VLOG prints incorrect task counts. * [MESOS-10096] - Reactivating a draining agent leaves the agent in draining state. * [MESOS-10097] - After HTTP framework disconnects, heartbeater idle-loops instead of being deleted. * [MESOS-10098] - Mesos agent fails to start on outdated systemd. * [MESOS-10100] - Recently introduced PathTest.Relative and PathTest.PathIteration fail on windows. * [MESOS-10102] - MasterAPITest.ReservationUpdate is flaky * [MESOS-10103] - MSVC build can segfault when composing authorization Action for updating reservation. * [MESOS-10107] - containeriser: failed to remove cgroup - EBUSY * [MESOS-10109] - After failover, master crashes on re-adding an agent with maintenance schedule set. * [MESOS-10110] - Libprocess ignores most protobuf (de)serialisation failure cases. * [MESOS-10111] - Failed check in libevent_ssl_socket.cpp: 'self->bev' Must be non NULL * [MESOS-10113] - OpenSSLSocketImpl with 'support_downgrade' waits for incoming bytes before accepting new connection. * [MESOS-10114] - OpenSSLSocketImpl with 'support_downgrade' can silently stop accepting sockets. * [MESOS-10116] - Attempt to reactivate disconnected agent crashes the master * [MESOS-10118] - Agent incorrectly handles draining when empty * [MESOS-10120] - Authorization for /logging/toggle and /metrics/snapshot is skipped on Windows. * [MESOS-10123] - Windows overlapped IO discard handling can drop data. * [MESOS-10124] - OpenSSLSocketImpl on Windows with 'support_downgrade' is incorrectly polling for read readiness. * [MESOS-10125] - Web UI roles tree files are missing from automake install. * [MESOS-10128] - Performance regression in HierarchicalAllocations_BENCHMARK_Test.PersistentVolumes** Epic * [MESOS-9981] - Introduce a Mesos API to update reservations * [MESOS-10001] - Resource Limits and Requests * [MESOS-10034] - Agent/executor domain socket communication
** ๐ Improvement * [MESOS-7245] - Add a Windows segfault handler for stacktraces * [MESOS-9123] - Expose quota consumption metrics. * [MESOS-9497] - Parallel reads for expensive master v1 read-only calls. * [MESOS-9914] - Refactor
MesosTest::StartSlave
in favour of builder style interface * [MESOS-9948] - master::Slave::hasExecutor occupies 37% of a 150 second perf sample. * [MESOS-9964] - Support destroying UCR containers in provisioning state * [MESOS-9972] - Update Names for TLS-related environment variables in libprocess. * [MESOS-10016] - Add a benchmark for HierarchicalAllocatorProcess::updateAllocation() * [MESOS-10017] - Log all reverse DNS lookup failures in 'legacy' TLS (SSL) hostname validation scheme. * [MESOS-10026] - Improve v1 operator API read performance. * [MESOS-10056] - Perform synchronous authorization for scheduler calls. * [MESOS-10057] - Perform synchronous authorization for outgoing events on event stream. * [MESOS-10095] - Agent draining logging makes it hard to tell which tasks did not terminate. * [MESOS-10112] - Log peer address during TLS handshake failures.** Wish * [MESOS-9630] - Consider moving linter setup to pre-commit
** Task * [MESOS-3938] - Consider allowing setting quotas for the default '*' role. * [MESOS-6084] - Deprecate and remove the included MPI framework * [MESOS-8503] - Improve UI when displaying frameworks with many roles. * [MESOS-9843] - Implement tests for the
containerizer/debug
endpoint. * [MESOS-9949] - Track allocated/offered in the allocator's role tree. * [MESOS-9974] - Remove support/mesos-style.py transition script * [MESOS-9982] - Add a 'source' field to operator API ReserveResources protobuf * [MESOS-9983] - Intermediate rejection of Reserve operations with source set * [MESOS-9984] - Provide a function to compute a common "reservation ancestor" between two 'Resources' * [MESOS-9985] - Update validation of 'ReserveResources' for 'source' * [MESOS-9986] - Update 'getConsumedResources' and 'getResourceConversions' for 'source' in reservations * [MESOS-9987] - Update 'Master::Http::_reserve' to also require 'source' resources * [MESOS-9988] - Add 'source' field to scheduler reservation API * [MESOS-9989] - Update 'Master::Http::_reserve' to pass 'source' into generated operation * [MESOS-9990] - Consolidate 'Master::authorizeReserveResources' overloads * [MESOS-9991] - Update 'Master::authorizeReserveResources' for re-reservations * [MESOS-9992] - Add end-to-end test excercising re-reservation operator API * [MESOS-9993] - Update operator API documentation for re-reservations * [MESOS-10002] - Design doc for container bursting * [MESOS-10009] - Implement glue code for the Windows event loop and OpenSSL's basic I/O abstraction * [MESOS-10010] - Implement an SSL socket for Windows, using OpenSSL directly * [MESOS-10033] - Design per-task cgroup isolation * [MESOS-10035] - Implementenable_http_executor_domain_sockets
agent flag * [MESOS-10036] - Implement agent code to create a domain socket on startup * [MESOS-10037] - Create code to bind-mount domain sockets into mesos-type executor containers * [MESOS-10038] - Implement agent code to listen on a domain socket * [MESOS-10039] - Let the default executor connect through a domain socket when available * [MESOS-10043] - Add resource limits into the protobuf messageTaskInfo
* [MESOS-10044] - Add a new capabilityTASK_RESOURCE_LIMITS
into Mesos agent * [MESOS-10045] - Validate task's resources limits and theshare_cgroups
field * [MESOS-10046] - Launch executor container with resource limits * [MESOS-10047] - Update the CPU subsystem in the cgroup isolator to set container's CPU resource limits * [MESOS-10048] - Update the memory subsystem in the cgroup isolator to set container's memory resource limits andoom_score_adj
* [MESOS-10049] - Add a new reason inTaskStatus::Reason
for the case that a task is OOM-killed due to exceeding its memory request * [MESOS-10050] - Update theupdate()
method of containerizer to handle container resource limits * [MESOS-10051] - Update theLaunchContainer
agent API to support container resource limits * [MESOS-10053] - Update Docker executor to set Docker container's resource limits andoom_score_adj
* [MESOS-10054] - Update Docker containerizer to set Docker container's resource limits andoom_score_adj
* [MESOS-10055] - Update Mesos UI to display the resource limits of tasks * [MESOS-10061] - Implement chmod() support for stout * [MESOS-10062] - Implement relative path computation for stout * [MESOS-10063] - Update default executor to callLAUNCH_CONTAINER
to launch nested containers * [MESOS-10064] - Accommodate the "Infinity" value in JSON * [MESOS-10065] - Update theupdate()
method of isolator interface to handle container resource limits * [MESOS-10067] - Update theupdate()
method of cgroups subsystem interface to handle container resource limits * [MESOS-10073] - Implement SSL downgrade on the native SSL socket * [MESOS-10074] - Adapt design for executor domain sockets for agent restarts * [MESOS-10075] - Add theshared_cgroups
field into the protobuf messageLinuxInfo
* [MESOS-10076] - Cgroups isolator: create nested cgroups * [MESOS-10077] - Cgroups isolator: allow updating and isolating resources for nested cgroups * [MESOS-10079] - Cgroups isolator: recover nested cgroups * [MESOS-10086] - Add support for systemd socket activation for mesos domain sockets * [MESOS-10087] - Update master & agent's HTTP endpoints for showing resource limits * [MESOS-10115] - Add documentation for task resource limits * [MESOS-10117] - Update theusage()
method of containerizer to set resource limits in theResourceStatistics
protobuf message** ๐ Documentation * [MESOS-9938] - Standalone container documentation * [MESOS-9979] - Add docs for FrameworkInfo updates and the UPDATE_FRAMEWORK call.
-
v1.10.0-rc1
May 18, 2020 -
v1.9.1 Changes
- ๐ This is a bug fix release.
** ๐ Bug
- [MESOS-9609] - Master check failure when marking agent unreachable.
- [MESOS-9964] - Support destroying UCR containers in provisioning state.
- [MESOS-9965] - Agent should not send
TASK_GONE_BY_OPERATOR
if the framework is not partition aware. - [MESOS-9966] - Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well.
- [MESOS-9968] - WWWAuthenticate header parsing fails when commas are in (quoted) realm
- [MESOS-9972] - Update Names for TLS-related environment variables in libprocess.
- [MESOS-10007] - Command executor can miss exit status for short-lived commands due to double-reaping.
- [MESOS-10008] - Very large quota values can crash master.
- [MESOS-10015] - updateAllocation() can stall the allocator with a huge number of reservations on an agent.
- [MESOS-10041] - Libprocess SSL verification can leak memory.
- [MESOS-10094] - Master's agent draining VLOG prints incorrect task counts.
- [MESOS-10096] - Reactivating a draining agent leaves the agent in draining state.
- [MESOS-10118] - Agent incorrectly handles draining when empty.
- [MESOS-10126] - Docker volume isolator needs to clean up the
info
struct regardless the result of unmount operation - [MESOS-10134] - Race between concurrent
javah
runs trying to createjava/jni
output directory. - [MESOS-10169] - Reintroduce image fetch deduplication while keeping it possible to destroy UCR containers in PROVISIONING state.
** ๐ Improvement
- [MESOS-9889] - Master CPU high due to unexpected foreachkey behaviour in Master::__reregisterSlave.
- [MESOS-9948] - master::Slave::hasExecutor occupies 37% of a 150 second perf sample.
- [MESOS-10017] - Log all reverse DNS lookup failures in 'legacy' TLS (SSL) hostname validation scheme.
- [MESOS-10095] - Agent draining logging makes it hard to tell which tasks did not terminate.
- [MESOS-10112] - Log peer address during TLS handshake failures.
-
v1.9.0 Changes
September 05, 2019๐ This release contains the following highlights:
Maintenance:
- Added new APIs to support automatic node draining via operator APIs. This serves as an alternative to framework-assisted draining using maintenance primitives. (MESOS-9753)
Resource Management:
- Support for quota limits has been added. The existing quota guarantees are deprecated in favor of using limits (and in the future, priorities).
Security
- A new libprocess flag
--hostname_validation_scheme
has been added. This allows users to enable a new RFC 6125-compliant hostname verification scheme based on primitives provided by OpenSSL. This will also improve performance by getting rid of all reverse DNS lookups. (MESOS-9784) - The use of anonymous cipher suites is now disallowed when TLS certificate verification is enabled. (MESOS-9810)
- A new libprocess flag
Containerization:
- A new
--docker_ignore_runtime
flag has been added. This causes the agent to ignore any runtime configuration present in Docker images. (MESOS-9760) - Add no-new-privileges isolator. A new Linux isolator has been added to support enabling the no_new_privs process control flag. (MESOS-9770)
- The Mesos containerizer now masks sensitive paths in
/proc
for containers that do not share the host's PID namespace. (MESOS-9771) - The Mesos containerizer now supports configurable IPC namespace and /dev/shm. Container can be configured to have a private IPC namespace and /dev/shm or share them from its parent, and the size of its private /dev/shm is also configurable. (MESOS-9795)
- The Mesos containerizer now includes ephemeral overlayfs storage in the task disk quota as well as sandbox storage. (MESOS-9900)
- A new
/containerizer/debug
HTTP endpoint has been added. This endpoint exposes debug information for the Mesos containerizer. At the moment, it returns a list of pending operations related to Isolators and Launchers. (MESOS-9756)
- A new
โ Additional API Changes:
Mesos components will now forego TLS certificate validation for incoming connections, unless
LIBPROCESS_SSL_REQUIRE_CERT
is set to true.The
Socket::connect(const Address&)
member function will now abort the program when called on aLibeventSSLSocket
. Instead, the new overloadSocket::connect(const Address&, const TLSClientConfig&)
must be used.NOTE: This new overload is only available when libprocess is compiled with
--enable-ssl
.
Unresolved Critical Issues:
- MESOS-9889 - Master CPU high due to unexpected foreachkey behaviour in Master::__reregisterSlave
- MESOS-9697 - Release RPMs are not uploaded to bintray
- MESOS-9579 - ExecutorHttpApiTest.HeartbeatCalls is flaky.
- MESOS-9536 - Nested container launched with non-root user may not be able to write to its sandbox via the environment variable
MESOS_SANDBOX
- MESOS-9520 - IOTest.Read hangs on Windows
- MESOS-9500 - spark submit with docker image on mesos cluster fails.
- MESOS-9426 - ZK master detection can become forever pending.
- MESOS-9393 - Fetcher crashes extracting archives with non-ASCII filenames.
- MESOS-9365 - Windows - GET_CONTAINERS API call causes the Mesos agent to fail
- MESOS-9355 - Persistence volume does not unmount correctly with wrong artifact URI
- MESOS-9352 - Data in persistent volume deleted accidentally when using Docker container and Persistent volume
- MESOS-9053 - Network ports isolator can falsely trigger while destroying containers.
- MESOS-9006 - The agent's GET_AGENT leaks resource information when using authorization
- MESOS-8877 - Docker container's resources will be wrongly enlarged in cgroups after agent recovery
- MESOS-8840 -
cpu.cfs_quota_us
may be accidentally set for command task using docker during agent recovery. - MESOS-8803 - Libprocess deadlocks in a test.
- MESOS-8679 - If the first KILL stuck in the default executor, all other KILLs will be ignored.
- MESOS-8608 - RmdirContinueOnErrorTest.RemoveWithContinueOnError fails.
- MESOS-8257 - Unified Containerizer "leaks" a target container mount path to the host FS when the target resolves to an absolute path
- MESOS-8256 - Libprocess can silently deadlock due to worker thread exhaustion.
- MESOS-8096 - Enqueueing events in MockHTTPScheduler can lead to segfaults.
- MESOS-8038 - Launching GPU task sporadically fails.
- MESOS-7971 - PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
- MESOS-7911 - Non-checkpointing framework's tasks should not be marked LOST when agent disconnects.
- MESOS-7748 - Slow subscribers of streaming APIs can lead to Mesos OOMing.
- MESOS-7721 - Master's agent removal rate limit also applies to agent unreachability.
- MESOS-7566 - Master crash due to failed check in DRFSorter::remove
- MESOS-7386 - Executor not cleaning up existing running docker containers if external logrotate/logger processes die/killed
- MESOS-6285 - Agents may OOM during recovery if there are too many tasks or executors
- MESOS-5989 - Libevent SSL Socket downgrade code accesses uninitialized memory / assumes single peek is sufficient.
All Resolved Issues:
** ๐ Bug
- [MESOS-2842] - Master crashes when framework changes principal on re-registration
- [MESOS-5804] - ExamplesTest.DynamicReservationFramework is flaky
- [MESOS-6382] - Add option to enable parallel test runner for cmake builds
- [MESOS-6605] - configure looks for wrong header file for elfio
- [MESOS-8968] - Wire
UPDATE_QUOTA
call. - [MESOS-9353] - libprocess triggers deprecation warnings when built against openssl 1.1.
- [MESOS-9395] - Check failure on
StorageLocalResourceProviderProcess::applyCreateDisk
. - [MESOS-9482] - Resource provider manager can crash on invalid data from resource providers
- [MESOS-9560] - ContentType/AgentAPITest.MarkResourceProviderGone/1 is flaky
- [MESOS-9594] - Test
StorageLocalResourceProviderTest.RetryRpcWithExponentialBackoff
is flaky. - [MESOS-9609] - Master check failure when marking agent unreachable
- [MESOS-9616] -
Filters.refuse_seconds
declines resources not in offers. - [MESOS-9667] - Check failure when executor for task using resource provider resources subscribes before agent is registered
- [MESOS-9698] - DroppedOperationStatusUpdate test is flaky
- [MESOS-9707] - Calling link::lo() may cause runtime error
- [MESOS-9711] - Avoid shutting down executors registering before a required resource provider.
- [MESOS-9712] - StorageLocalResourceProviderTest.CsiPluginRpcMetrics is flaky.
- [MESOS-9719] - Test
AgentFailoverHTTPExecutorUsingResourceProviderResources
is flaky. - [MESOS-9727] - Heartbeat calls from executor to agent are reported as errors
- [MESOS-9733] - Random sorter generates non-uniform result for hierarchical roles.
- [MESOS-9750] - Agent V1 GET_STATE response may report a complete executor's tasks as non-terminal after a graceful agent shutdown
- [MESOS-9765] - Test
ROOT_CreateDestroyPersistentMountVolumeWithReboot
is flaky. - [MESOS-9766] - /processes endpoint can hang.
- [MESOS-9779] -
UPDATE_RESOURCE_PROVIDER_CONFIG
agent call returns 404 ambiguously. - [MESOS-9782] - Random sorter fails to clear removed clients.
- [MESOS-9785] - Frameworks recovered from reregistered agents are not reported to master
/api/v1
subscribers. - [MESOS-9786] - Race between two REMOVE_QUOTA calls crashes the master.
- [MESOS-9803] - Memory leak caused by an infinite chain of futures in
UriDiskProfileAdaptor
. - [MESOS-9808] - libprocess can deadlock on termination (cleanup() vs use() + terminate())
- [MESOS-9811] - Don't use reverse DNS for hostname validation
- [MESOS-9831] - Master should not report disconnected resource providers.
- [MESOS-9835] -
QuotaRoleAllocateNonQuotaResource
is failing. - [MESOS-9836] - Docker containerizer overwrites
/mesos/slave
cgroups. - [MESOS-9852] - Slow memory growth in master due to deferred deletion of offer filters and timers.
- [MESOS-9854] - /roles endpoint should return both guarantees and limits.
- [MESOS-9856] - REVIVE call with specified role(s) clears filters for all roles of a framework.
- [MESOS-9861] - Make PushGauges support floating point stats.
- [MESOS-9870] - Simultaneous adding/removal of a role from framework's roles and its suppressed roles crashes the master.
- [MESOS-9875] - Mesos did not respond correctly when operations should fail
- [MESOS-9881] - StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery is flaky.
- [MESOS-9882] - Mesos.UpdateFrameworkV0Test.SuppressedRoles is flaky.
- [MESOS-9886] - RoleTest.RolesEndpointContainsConsumedQuota is flaky.
- [MESOS-9887] - Race condition between two terminal task status updates for Docker/Command executor.
- [MESOS-9888] - /roles and GET_ROLES do not expose roles with only static reservations
- [MESOS-9890] - /roles and GET_ROLES does not always expose parent roles.
- [MESOS-9893] -
volume/secret
isolator should cleanup the stored secret from runtime directory when the container is destroyed - [MESOS-9894] - Mesos failed to build due to fatal error C1083 on Windows using MSVC.
- [MESOS-9895] - SlaveTest.DrainingAgentRejectLaunch is flaky
- [MESOS-9901] - jsonify uses non-standard mapping for protobuf map fields.
- [MESOS-9902] - Mesos failed to build due to error C2280 on windows with MSVC
- [MESOS-9906] - Libprocess tests hangs on arm
- [MESOS-9909] - Mesos agent crashes after recovery when there is nested container joins a CNI network
- [MESOS-9922] - MasterQuotaTest.RescindOffersEnforcingLimits is flaky
- [MESOS-9925] - Default executor takes a couple of seconds to start and subscribe Mesos agent
- [MESOS-9930] - DRF sorter may omit clients in sorting after removing an inactive leaf node.
- [MESOS-9934] - Master does not handle returning unreachable agents as draining/deactivated
- [MESOS-9935] - The agent crashes after the disk du isolator supporting rootfs checks.
- [MESOS-9952] - ExampleTest.DiskFullFramework is slow
- [MESOS-9956] - CSI plugins reporting duplicated volumes will crash the agent.
** Epic
- [MESOS-9534] - CSI Spec v1.0 Support.
- [MESOS-9756] - Introduce a container debug endpoint.
- [MESOS-9784] - Client side SSL certificate verification in Libprocess.
- [MESOS-9795] - Support configurable /dev/shm and IPC namespace.
** ๐ Improvement
- [MESOS-7258] - Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.
- [MESOS-8456] - Allocator should allow roles to burst above guarantees but below limits.
- [MESOS-8789] - /roles and webui roles table should display distinct offered and allocated resources.
- [MESOS-9254] - Make SLRP be able to update its volumes and storage pools.
- [MESOS-9545] - Marking an unreachable agent as gone should transition the tasks to terminal state
- [MESOS-9618] - Display quota consumption in the webui.
- [MESOS-9640] - Add authorization support for
UPDATE_QUOTA
call. - [MESOS-9668] - Add authorization support for the new
GET_QUOTA
call. - [MESOS-9669] - Deprecate v0 quota calls.
- [MESOS-9695] - Remove the duplicate pid check in Docker containerizer
- [MESOS-9701] - Allocator's roles map should track reservations.
- [MESOS-9724] - Flatten the weighted shuffling in the random sorter.
- [MESOS-9758] - Take ports out of the GET_ROLES endpoints.
- [MESOS-9759] - Log required quota headroom and available quota headroom in the allocator.
- [MESOS-9760] - Decouple Docker runtime isolator manifest configuration from image provider
- [MESOS-9769] - Add direct containerized support for filesystem operations.
- [MESOS-9770] - Add no-new-privileges isolator.
- [MESOS-9771] - Mask sensitive procfs paths.
- [MESOS-9778] - Randomized the agents in the second allocation stage.
- [MESOS-9787] - Log slow SSL (TLS) peer reverse DNS lookup.
- [MESOS-9791] - Libprocess does not support server only SSL certificate verification.
- [MESOS-9799] - Adopt container file operations in secrets volumes.
- [MESOS-9802] - Remove quota role sorter in the allocator.
- [MESOS-9805] - Run cgroup subsystems before moving the target PID.
- [MESOS-9806] - Address allocator performance regression due to the addition of quota limits.
- [MESOS-9807] - Introduce a
struct Quota
wrapper. - [MESOS-9812] - Add achievability validation for update quota call.
- [MESOS-9820] - Add
updateQuota()
method to the allocator. - [MESOS-9833] - Introduce an agent flag for the default
/dev/shm
size - [MESOS-9876] - Use geteuid to determine subprocess' user when launching task.
- [MESOS-9878] - Enable libprocess users to pass a custom SSL context when using Socket
- [MESOS-9900] - Include overlayfs upperdir in disk quota accounting.
- [MESOS-9908] - Introduce a new agent flag and support docker volume chown to task user.
- [MESOS-9917] - Store a role tree in the allocator.
- [MESOS-9932] - Removal of a role from the suppression list should be equivalent to REVIVE.
** Task
- [MESOS-8486] - Webui should display role limits.
- [MESOS-9485] - Unit test for master operation authorization.
- [MESOS-9565] - Unit tests for creating and destroying persistent volumes in SLRP.
- [MESOS-9598] - Update GET
/quota
to return both guarantees and limits. - [MESOS-9599] - Update
GET_QUOTA
to return both guarantees and limits. - [MESOS-9600] - Deprecate
SET_QUOTA
andREMOVE_QUOTA
calls in favor ofUPDATE_QUOTA
. - [MESOS-9601] - Persist
QuotaConfig
s in the registry. - [MESOS-9602] - Provide backward compatibility for old quota configurations.
- [MESOS-9603] - Add quota limits metrics.
- [MESOS-9627] - Test CSI v1 in SLRP unit tests.
- [MESOS-9699] - Pull in glog 0.4.0
- [MESOS-9710] - Add tests to ensure random sorter performs correct weighted sorting.
- [MESOS-9715] - Support specifying output file name for curl fetcher plugin
- [MESOS-9754] - Design doc for agent draining
- [MESOS-9757] - Design doc for container debug endpoint.
- [MESOS-9775] - Design doc for UCR shared memory.
- [MESOS-9788] - Configurable IPC namespace and shared memory in
namespaces/ipc
isolator - [MESOS-9793] - Implement UPDATE_FRAMEWORK call in V0 API for C++/Java
- [MESOS-9809] - Use OpenSSL built-in functions for hostname validation
- [MESOS-9810] - Reject certificate-less ciphers when certificate verification is enabled
- [MESOS-9814] - Implement DrainAgent master/operator call with associated registry actions
- [MESOS-9816] - Add draining state information to master state endpoints
- [MESOS-9817] - Add minimum master capability for draining and deactivation states
- [MESOS-9818] - Implement minimal agent-side draining handler
- [MESOS-9821] - Agent kills all tasks when draining
- [MESOS-9822] - Agent recovery code for task draining
- [MESOS-9823] - Agent should modify status updates while draining
- [MESOS-9825] - Introduce an agent flag to disallow sharing the IPC namespace from the host.
- [MESOS-9826] - Set up
/dev/shm
infilesystem/linux
isolator only whennamespaces/ipc
isolator is not enabled - [MESOS-9827] - Introduce the configurable shm protobuf API.
- [MESOS-9828] - Document the IPC namespace and shm on UCR.
- [MESOS-9829] - Implement the container debug endpoint on slave/http.cpp
- [MESOS-9837] - Implement
FutureTracker
class along with helper functions. - [MESOS-9839] - Implement
IsolatorTracker
class. - [MESOS-9840] - Implement
LauncherTracker
class. - [MESOS-9841] - Integrate
IsolatorTracker
andLinuxLauncher
with Mesos containerizer. - [MESOS-9842] - Implement tests for the
FutureTracker
class and for its helper functions. - [MESOS-9845] - Add docs for automatic agent draining
- [MESOS-9846] - Update UI for agent draining
- [MESOS-9849] - Add support for per-role REVIVE / SUPPRESS to V0 scheduler driver.
- [MESOS-9853] - Update Docker executor to allow kill policy overrides
- [MESOS-9860] - Agent should erase DrainInfo when draining complete
- [MESOS-9862] - Agent should fail task launches while draining
- [MESOS-9871] - Expose quota consumption in /roles endpoint.
- [MESOS-9874] - Add environment variable
MESOS_ALLOCATION_ROLE
to the task/container. - [MESOS-9892] - Test various agent state transitions involving agent draining
- [MESOS-9907] - Retain agent draining start time in master
** ๐ Documentation
- [MESOS-9427] - Revisit quota documentation.
-
v1.9.0-rc3
September 02, 2019 -
v1.9.0-rc2
August 28, 2019 -
v1.9.0-rc1
August 27, 2019 -
v1.8.2 Changes
- ๐ This is a bug fix release.
** ๐ Bug
- [MESOS-9609] - Master check failure when marking agent unreachable.
- [MESOS-9785] - Frameworks recovered from reregistered agents are not reported to master
/api/v1
subscribers. - [MESOS-9836] - Docker containerizer overwrites
/mesos/slave
cgroups. - [MESOS-9868] - NetworkInfo from the agent /state endpoint is not correct.
- [MESOS-9887] - Race condition between two terminal task status updates for Docker/Command executor.
- [MESOS-9893] -
volume/secret
isolator should cleanup the stored secret from runtime directory when the container is destroyed. - [MESOS-9925] - Default executor takes a couple of seconds to start and subscribe Mesos agent.
- [MESOS-9964] - Support destroying UCR containers in provisioning state.
- [MESOS-9966] - Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well.
- [MESOS-9968] - WWWAuthenticate header parsing fails when commas are in (quoted) realm
- [MESOS-10007] - Command executor can miss exit status for short-lived commands due to double-reaping.
- [MESOS-10015] - updateAllocation() can stall the allocator with a huge number of reservations on an agent.
- [MESOS-10126] - Docker volume isolator needs to clean up the
info
struct regardless the result of unmount operation - [MESOS-10134] - Race between concurrent
javah
runs trying to createjava/jni
output directory. - [MESOS-10169] - Reintroduce image fetch deduplication while keeping it possible to destroy UCR containers in PROVISIONING state.
** ๐ Improvement
- [MESOS-9889] - Master CPU high due to unexpected foreachkey behaviour in Master::__reregisterSlave.
- [MESOS-9948] - master::Slave::hasExecutor occupies 37% of a 150 second perf sample.
- [MESOS-10017] - Log all reverse DNS lookup failures in 'legacy' TLS (SSL) hostname validation scheme.