Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

z_cstart bug #32444

Closed
hongshui3000 opened this issue Feb 18, 2021 · 13 comments
Closed

z_cstart bug #32444

hongshui3000 opened this issue Feb 18, 2021 · 13 comments
Assignees
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@hongshui3000
Copy link
Contributor

hongshui3000 commented Feb 18, 2021

Describe the bug
file:init.c

FUNC_NORETURN void z_cstart(void)
{
	/* gcov hook needed to get the coverage report.*/
	gcov_static_init();

	LOG_CORE_INIT();

	/* perform any architecture-specific initialization */
	arch_kernel_init();

#if defined(CONFIG_MULTITHREADING)
	/* Note: The z_ready_thread() call in prepare_multithreading() requires
	 * a dummy thread even if CONFIG_ARCH_HAS_CUSTOM_SWAP_TO_MAIN=y
	 */
	struct k_thread dummy_thread;

	z_dummy_thread_init(&dummy_thread);
#endif

	/* perform basic hardware initialization */
	z_sys_init_run_level(_SYS_INIT_LEVEL_PRE_KERNEL_1);-------------------------------------->  where the bug originated
	z_sys_init_run_level(_SYS_INIT_LEVEL_PRE_KERNEL_2);--------------------------------------> where the bug originated

#ifdef CONFIG_STACK_CANARIES
	uintptr_t stack_guard;

	z_early_boot_rand_get((uint8_t *)&stack_guard, sizeof(stack_guard));
	__stack_chk_guard = stack_guard;
	__stack_chk_guard <<= 8;
#endif	/* CONFIG_STACK_CANARIES */

#ifdef CONFIG_MULTITHREADING
	switch_to_main_thread(prepare_multithreading());  --------------------------------------------> where the bug originated
#else
#ifdef ARCH_SWITCH_TO_MAIN_NO_MULTITHREADING
	/* Custom ARCH-specific routine to switch to main()
	 * in the case of no multi-threading.
	 */
	ARCH_SWITCH_TO_MAIN_NO_MULTITHREADING(bg_thread_main,
		NULL, NULL, NULL);
#else
	bg_thread_main(NULL, NULL, NULL);

	/* LCOV_EXCL_START
	 * We've already dumped coverage data at this point.
	 */
	irq_lock();
	while (true) {
	}
	/* LCOV_EXCL_STOP */
#endif
#endif /* CONFIG_MULTITHREADING */

	/*
	 * Compiler can't tell that the above routines won't return and issues
	 * a warning unless we explicitly tell it that control never gets this
	 * far.
	 */

	CODE_UNREACHABLE; /* LCOV_EXCL_LINE */
}

If the interrupt is enabled when the driver is initialized, and the interrupt is entered before the interrupt stack is initialized, the software will generate a bug.

If I understand it correctly, the initialization of the interrupt stack will always occur after the interrupt is enabled .

When the interrupt is enabled, an interrupt is generated immediately, and the interrupt stack is not ready yet, then the bug I said will appear.

interrupt en:

z_sys_init_run_level(_SYS_INIT_LEVEL_PRE_KERNEL_2)

interrupt stack:

_kernel.cpus[i].irq_stack =(Z_KERNEL_STACK_BUFFER(z_interrupt_stacks[i]) +K_KERNEL_STACK_SIZEOF(z_interrupt_stacks[i])) 

I found this problem when I debugged my systemtick

@hongshui3000 hongshui3000 added the bug The issue is a bug, or the PR is fixing a bug label Feb 18, 2021
@carlescufi carlescufi changed the title z_cstart Bug ?? z_cstart bug Feb 22, 2021
@yerabolu yerabolu assigned yerabolu and unassigned andyross Feb 22, 2021
@nashif nashif added the priority: medium Medium impact/importance bug label Feb 22, 2021
@StefJar
Copy link

StefJar commented Feb 28, 2021

any progress on that so far? Maybe my problem is related to that (#32261)

@yerabolu
Copy link
Contributor

yerabolu commented Mar 1, 2021

@hongshui3000 : Interrupt is disabled when it enters z_cstart and gets enabled only when it enters main thread through bg_thread_main.

Also, could you give some more information on how this is encountered and which platform its encountered on?

@yerabolu
Copy link
Contributor

yerabolu commented Mar 1, 2021

any progress on that so far? Maybe my problem is related to that (#32261)

Looking at the bug it doesn't look to be related. But will look into it further and get back in here.

@StefJar
Copy link

StefJar commented Mar 1, 2021

@yerabolu my issue is happens at the start up of zephyr when the main function is executed and a task switch happens. Its highly probable that it is triggered through an ISR.
Since a (custom) bootloader runs, before zephyr starts, it might happen that an interrupt etc. is activated there(maybe bug).

@hongshui3000
Copy link
Contributor Author

hongshui3000 commented Mar 2, 2021

@hongshui3000 : Interrupt is disabled when it enters z_cstart and gets enabled only when it enters main thread through bg_thread_main.

Also, could you give some more information on how this is encountered and which platform its encountered on?
file : cavs_timer.c

int z_clock_driver_init(const struct device *device)
{
	uint64_t curr = count();

	IRQ_CONNECT(TIMER_IRQ, 0, compare_isr, 0, 0);
	set_compare(curr + CYC_PER_TICK);
	last_count = curr;
	irq_enable(TIMER_IRQ); ---------------------------------------------------------->this
	return 0;
}

and file: uart_mcux.c

static int uart_mcux_init(const struct device *dev)
{
#ifdef CONFIG_UART_INTERRUPT_DRIVEN
	const struct uart_mcux_config *config = dev->config;
#endif
	struct uart_mcux_data *data = dev->data;
	int err;

	err = uart_mcux_configure(dev, &data->uart_cfg);
	if (err != 0) {
		return err;
	}

#ifdef CONFIG_UART_INTERRUPT_DRIVEN
	config->irq_config_func(dev);
#endif

	return 0;
}

#ifdef CONFIG_UART_INTERRUPT_DRIVEN
#define UART_MCUX_CONFIG_FUNC(n)					\
	static void uart_mcux_config_func_##n(const struct device *dev)	\
	{								\
		IRQ_CONNECT(DT_INST_IRQ_BY_NAME(n, status, irq),	\
			    DT_INST_IRQ_BY_NAME(n, status, priority),	\
			    uart_mcux_isr, DEVICE_GET(uart_##n), 0);	\
									\
		irq_enable(DT_INST_IRQ_BY_NAME(n, status, irq));	\---------------------------->this
									\
		IRQ_CONNECT(DT_INST_IRQ_BY_NAME(n, error, irq),		\
			    DT_INST_IRQ_BY_NAME(n, error, priority),	\
			    uart_mcux_isr, DEVICE_GET(uart_##n), 0);	\
									\
		irq_enable(DT_INST_IRQ_BY_NAME(n, error, irq));		\
	}
#define UART_MCUX_IRQ_CFG_FUNC_INIT(n)					\
	.irq_config_func = uart_mcux_config_func_##n
#define UART_MCUX_INIT_CFG(n)						\
	UART_MCUX_DECLARE_CFG(n, UART_MCUX_IRQ_CFG_FUNC_INIT(n))
#else
#define UART_MCUX_CONFIG_FUNC(n)
#define UART_MCUX_IRQ_CFG_FUNC_INIT
#define UART_MCUX_INIT_CFG(n)						\
	UART_MCUX_DECLARE_CFG(n, UART_MCUX_IRQ_CFG_FUNC_INIT)
#endif

If you use them, then the interrupt may be enabled before the interrupt stack is initialized. At least in the xtensa architecture I don’t see interrupts being turned off.If log is turned on, I feel that the interrupt will also be enabled in advance.

In the xtensa architecture, if I deliberately reduce the value of CYC_PER_TICK (such as 0 or 1), then the interrupt will be generated before entering bg_thread_main

I feel that this should involve all the drivers started by _SYS_INIT_LEVEL_PRE_KERNEL_2 and _SYS_INIT_LEVEL_PRE_KERNEL_1 and whether the architecture will be enabled to interrupt. Our system does not make it clear that interrupts cannot be enabled in this place. So I feel that there may be potential problems at present or in the future.

@yerabolu
Copy link
Contributor

yerabolu commented Mar 3, 2021

Thank You for the info. Investigating further to see which drivers are started by _SYS_INIT_LEVEL_PRE_KERNEL_2 and _SYS_INIT_LEVEL_PRE_KERNEL_1. Will update on it by EOD.

@yerabolu
Copy link
Contributor

yerabolu commented Mar 16, 2021

@hongshui3000: Ran the below test to see if the irq_lock() and irq_unlock() are setting correctly and ran the tests. All the tests passed. Could you try it on your end as well and see if there is any issue.

	/* Custom ARCH-specific routine to switch to main()
	 * in the case of no multi-threading.
	 */
	ARCH_SWITCH_TO_MAIN_NO_MULTITHREADING(bg_thread_main,
		NULL, NULL, NULL);
#else
	unsigned int key = irq_lock();
	__ASSERT(!arch_irq_unlocked(key),"");
	irq_unlock(key);
	bg_thread_main(NULL, NULL, NULL);

@hongshui3000
Copy link
Contributor Author

@yerabolu
Thank you for the information, I will try to debug my program later. But what I want to know is when zephyr rtos is started, "irq_lock" and "irq_unlock" are executed in the above place. Do you suspect that the execution of "irq_lock" in my code failed?

@yerabolu
Copy link
Contributor

yerabolu commented Mar 17, 2021

@yerabolu
Thank you for the information, I will try to debug my program later. But what I want to know is when zephyr rtos is started, "irq_lock" and "irq_unlock" are executed in the above place. Do you suspect that the execution of "irq_lock" in my code failed?

@hongshui3000
Looking at the code mentioned above, Interrupt is disabled when the zephyr rtos is started and then gets enabled when it enters main thread. And the above test confirms that.

When you mention "Do you suspect that the execution of "irq_lock" in my code failed?" - Are there are any new additions/changes to the above code that you are working on ?

@hongshui3000
Copy link
Contributor Author

hongshui3000 commented Mar 18, 2021

@yerabolu
I understand what you said. But when I was debugging the xtensa architecture, I did see an interruption. I will continue to check my own code.

When the "_SYS_INIT_LEVEL_PRE_KERNEL_2" driver is initialized, some drivers enable interrupts. These drivers are initialized before entering main thread.

Whether an interrupt occurs depends on the processor interrupt architecture and the implementation of irq_lock.

It's just that what I see in the interrupt architecture of xtensa is like this, not the interrupt architecture of the finished chip (the interrupt controller that increases the hardware may control the global interrupt)

I put this question here, the main purpose is to think that in a system without global interrupt control, the timing of the interrupt enable may cause problems.

@yerabolu
Copy link
Contributor

yerabolu commented Mar 18, 2021

@hongshui3000 : Unfortunately i am not able to see this issue on my end and without a way to reproducing it in some way i will not be able to figure out if it is actually causing problems with the timing of enabling interrupts. But on the Zephyr master i don't see any issue.

Adding @andyross to take a look on this.

@andyross
Copy link
Contributor

I understand what you said. But when I was debugging the xtensa architecture, I did see an interruption. I will continue to check my own code.

Honestly this sounds like a platform bug. The way this design is supposed to work:

  • The platform layer enters z_cstart() with interrupts masked[1]
  • The OS initializes core elements like the _kernel struct, main thread, interrupt stacks, etc...
  • The OS does its device- and driver-level initialization
  • The OS then context switches to the main thread (via a z_swap() to init.c:bg_thread_main()), which is the first moment where interrupts become unmasked

(And on the SMP CPUs the process is similar, they start with interrupts masked and only enable them once switching into user/idle threads)

[1] In the sense of arch_irq_lock(). Note that arch_irq_enable(int) is a separate API designed to selectively mask specific interrupts for the purpose of device control. An interrupt that is "enabled" will still not be delivered if it is globally masked by irq_lock(). On Xtensa, "enabled" corresponds to a bit set in the INTENABLE special register, while irq_lock() works by setting the INTLEVEL field in the bottom bits of PS to XCHAL_EXCM_LEVEL.

@hongshui3000
Copy link
Contributor Author

hongshui3000 commented Mar 19, 2021

@andyross @yerabolu
I have basically obtained the information I want, thanks for the reply. I think the issue can be closed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

No branches or pull requests

6 participants