Optimize a Program with Schedules¶
Oftentimes, only compiling your programs to native code is not enough, and you need further optimizations. This can be done by applying "schedules" (explicit program transformations) to you program.
Example: Parallel Vector addition¶
import freetensor as ft
import numpy as np
n = 4
# Add verbose=1 to see the resulting native code
@ft.optimize(schedule_callback=lambda s: s.parallelize('Li', 'openmp')
) # <-- 2. Apply the schedule
def test(a: ft.Var[(n,), "int32"], b: ft.Var[(n,), "int32"]):
y = ft.empty((n,), "int32")
#! label: Li # <-- 1. Label the loop as Li
for i in range(n):
y[i] = a[i] + b[i]
return y
y = test(np.array([1, 2, 3, 4], dtype="int32"),
np.array([2, 3, 4, 5], dtype="int32")).numpy()
print(y)
Here is an example of a parallel vector addition executed with OpenMP multithreading. Each element is computed by one thread. To achieve this, there are two steps:
- Label the loop to be parallelized with a
#! label:comment. Herelabelrefers to label of an AST node, which is not required to be unique. - Apply a
parallelizeschedule toLiin theschedule_callbackargument tooptimize; since theLilabel is unambiguous here, the onlyLiloop is selectd and parallelized.
And you are done. You can have a look at the generated OpenMP multithreaded code by setting verbose=1.
Parameter s in schedule_callback is a Schedule object. Besides parallelize, there are more supported scheduling primitives.
If you are using the @optimize_to_pytorch integration, you need to set schedules for the forward pass and the backward pass separately.
Combining Multiple Schdules¶
Some optimizations can be done by applying multiple schedules. For example, a tiled matrix-multiplication can be done by first split the loops, then reorder them, and finally apply caches to create tile tensors.
In order to demonstrate the idea, we show a simplier example here: still a vector addtion, but with the loop split and only the outer one parallelized. Please note that this is an example only for demonstration. Usually you do not need it because OpenMP has its own "schedule(static)" for parallelized loops.
import freetensor as ft
import numpy as np
n = 1024
def sch(s):
outer, inner = s.split('Li', 32)
s.parallelize(outer, 'openmp')
# Set verbose=1 to see the resulting native code
# Set verbose=2 to see the code after EVERY schedule
@ft.optimize(schedule_callback=sch)
def test(a: ft.Var[(n,), "int32"], b: ft.Var[(n,), "int32"]):
y = ft.empty((n,), "int32")
#! label: Li
for i in range(n):
y[i] = a[i] + b[i]
return y
y = test(np.array(np.arange(1024), dtype="int32"),
np.array(np.arange(1024), dtype="int32")).numpy()
print(y)
One important thing is to track labels of the loops, because the labels will change after schedules. You get labels (to be precise, IDs, which is can be looked-up by labels) of new loops generated from one schedule from its return values (outer and inner in this case), and pass them to a next schedule.
Specify What to Schedule by Selectors¶
In the example above, we label a loop Li and apply schedules on it. It is straight-forward in a tiny example, but as programs grow, it often gets hard to track each statement by a unique label, especially there are inlined function calls. To make things easy, FreeTensor supports specifying a statement by a selector, written in the following rules:
- A label is a selector. E.g.,
Limatches a statement with a labelLi. - (For debugging only) A numerical ID is also a selector. E.g.,
#31. - A node type surrounded in angle brackets (
<>) is also a selector. E.g.,<For>matches for-loop statements. - A selector can be extended to match a new statement produced by a previous schedule. E.g.,
$split.0{Li}matches the outer loop split from the loopLi. This is useful when return values from schedules are hard to track. Please refer the API document for detailed grammar. - Selectors can be combined to match a statement by nesting order.
A<-Bmatches a statementADIRECTLY NESTED IN another statementB.A<<-Bmatches a statementADIRECTLY or INDIRECTLY nested in another statementB.A<-(B<-)*Cmatches a statementADIRECTLY or INDIRECTLY nested in another statementCwith intermedaite nesting statements satisfying the condition inB.B->Amatches a statementBdirectly OUT OF another statementA.B->>AandC->(B->)*Aare alike. (A,B,Ccan be nested selectors.) Use<-|for the root node, and->|for a leaf node. - Selectors can be combined to match a statement by program order.
A<:Bmatches a statementADIRECTLY BEFORE another statementB.A<<:Bmatches a statementADIRECTLY or INDIRECTLY before another statementB.A<:(B<:)*Cmatches a statementADIRECTLY or INDIRECTLY before another statementCwith intermediate statements satisfying the condition inB.B:>Amatches a statmentBdirectly AFTER another statementA.B:>>AandC:>(B:>)*Aare alike. A statement's descents (other statements nested in) or ancestors (other statements nesting it) are neither considered before or after the statement in the program order. "Directly" means there is no other statement between the comparing statements, so a statment can "directly before" both another statement and its parent. - Selectors can be combined to match a statement in a function call.
A<~Bmatches a statementADIRECTLY called by a call siteB.A<<~Bmatches a statementADIRECTLY or INDIRECTLY called by a call siteB.A<~(B<~)*Cmatches a statementADIRECTLY or INDIRECTLY called by a call siteCwith intermediate call sites satisfying the condition inB. (A,B,Ccan be nested selectors.) Use<~|for the root function. - All the arrow-like selectors (
<-,<~,<:, etc.) are right-associated. For example,A<-B<-CmatchesAnested inB, whereBis nested inC. - All the arrow-like selectors can be used with the first argument omitted. For example,
<-Bmatches ALL statements nested inB. - Selectors can be combined with logical "and" (
&), "or" (|), "not" (!) and parentheses. E.g.,Li|Ljmatches a statement labeledLiORLj.Li&Ljmatches a statement labeledLi&Lj.
All schedules support passing selectors.
Auto Scheduling (Experimental)¶
Manually scheduling a program requires a lot of efforts. We provide an experimental automatic scheduling functions in Schedule. You can call s.auto_schedule to pick schedules fully automatically. s.auto_schedule calls other s.auto_xxxxxx functions internally, you can also call one or some of them instead. Please note that these auto-scheduling functions are experimental, and their API is subject to changes.