Jyotish: A Novel Framework for Constructing Predictive Model of People Movement from Joint Wifi/Bluetooth Trace moreLong Vu, Quang Do, Klara Nahrstedt. Ninth Annual IEEE International Conference on Pervasive Computing and Communications (PerCom 2011). (Acceptance rate 11%). Mark Weiser Best Paper Award |
80 views |
Jyotish: A Novel Framework for Constructing Predictive Model of People Movement from Joint Wifi/Bluetooth Trace
Long Vu, Quang Do, Klara Nahrstedt Department of Computer Science, University of Illinois Email:{longvu2,quangdo2,klara}@illinois.edu
Abstract—It is well known that people movement exhibits a high degree of repetition since people visit regular places and make regular contacts for their daily activities. This paper1 presents a novel framework named Jyotish2 , which constructs a predictive model by exploiting the regularity of people movement found in the real joint Wifi/Bluetooth trace. The constructed model is able to answer three fundamental questions: (1) where the person will stay, (2) how long she will stay at the location, and (3) who she will meet. In order to construct the predictive model, Jyotish includes an efficient clustering algorithm to cluster Wifi access point information in the Wifi trace into locations. Then, we construct a Naive Bayesian classifier to assign these locations to records in the Bluetooth trace and obtain a fine granularity of people movement. Next, the fine grain movement trace is used to construct the predictive model including location predictor, stay duration predictor, and contact predictor to provide answers for three questions above. Finally, we evaluate the constructed predictive model over the real Wifi/Bluetooth trace collected by 50 participants in University of Illinois campus from March to August 2010. Evaluation results show that Jyotish successfully constructs a predictive model, which provides a considerably high prediction accuracy of people movement.
I. I NTRODUCTION The ability to accurately predict people movement is crucial to numerous domains such as wireless networks, HCI, social science, urban planning, transportation, etc. While predicting the movement of a person, we seek the answers to three fundamental questions: (1) where will the person stay at a future time (i.e., location)?, (2) How long will she stay at the location (i.e., stay duration)?, and (3) Who will she meet (i.e., contact)? Providing answers to three questions altogether remains challenging due to the: (1) complex nature of people movement, and (2) lack of a realistic people movement trace used to construct an accurate predictive model of people movement. On the other hand, it is believed that people movement exhibits a high degree of repetition, in which people visit regular places for their daily activities [1]. As a result, there have been previous projects that exploited the regularity in past movement to predict future movement of people.
1 1 This work is supported by Boeing Trusted Software Center at the Information Trust Institute, University of Illinois. 2 In Sanskrit, Jyotish (Ji-o-tish) is the person who predicts future events.
The first class of prediction methods focused on predicting location of people movement [2], [3], [4], [5], which essentially only answered the first question above. In particular, a large number of previous papers used the association trace between the laptop/PDA and the Wifi access points (i.e., WLAN trace) to derive and evaluate their location predictors [4], [3]. However, there was a fundamental weakness of using WLAN trace in constructing location predictor since the laptop user did not always turn on the laptop and did not always carry it with her. So, the collected associations of laptops and the Wifi access points could be used to understand the wireless usage rather than to predict the location of people. Other previous projects used cellular data trace to construct the location predictor [6], [7], in which the location was inferred from the cellular base station. However, since the transmission range of the cellular base station was ranging from several hundred meters (e.g., 500 m) to kilometers (e.g., 30 km), the location predictor derived by this inferred location might not provide needed fine granularity and accuracy. The second class of prediction methods answered the first two questions by providing predictions for the stay duration [8] or location and stay duration [9]. McNamara et al. predicted the stay duration of commuters of the subways to select the best source of media content [8]. Lee and Hou modeled user mobility by a semi-Markov process and devised a timed location prediction algorithm that predicted the future access point (i.e., the location in the paper’s context) of the user and the association duration [9]. Since the model was constructed and evaluated by the WLAN trace, it suffered from the same fundamental weaknesses as discussed in the previous paragraph. Recently, there have been several projects collecting ad hoc contact traces using portable experiment devices such as iMote, cellphone, PDA [10], [11], [12], [13]. These traces can be used to answer the third question about future contact. However, these traces did not have the location information and thus could not be used to answer the first two questions. In our recent work, we deployed the UIM scanning system on Google Android phones (i.e., UIM stands for University of Illinois Movement) to collect MAC addresses of Wifi access points and Bluetooth devices in the proximity of the experiment participants [14]. We observe that Wifi access
point information can be used to infer location [15] while Bluetooth MACs can be used to infer contact [10], [11]. The joint Wifi/Bluetooth trace thus can be used to study people movement. This paper presents the Jyotish framework, which exploits the regularity of people movement found in the joint Wifi/Bluetooth trace collected by the UIM system to construct a predictive model of people movement. In summary, our paper has following contributions: 1) To the best of our knowledge, Jyotish is the first framework, which constructs a predictive model of people movement from joint Wifi/Bluetooth trace. 2) Also, to the best of our knowledge the constructed predictive model is the first to predict future location, stay duration at the location, and contact altogether. 3) We present an efficient clustering algorithm to cluster Wifi access point information into locations by exploiting the regularity of people movement. Our algorithm overcomes the Wifi signal fluctuation in previous work [16], [17] and provides a finer grain of location than that derived from cellular base station [7]. 4) We evaluate the constructed predictive model over the real Wifi/Bluetooth trace collected by 50 experiment participants in University of Illinois campus from March to August 2010. This paper is organized as follows. We present the trace collected by UIM system and overview of Jyotish framework in Section II. Then, we present a clustering algorithm to cluster Wifi access point information into locations in Section III. These locations will be assigned to records in Bluetooth trace in Section IV. Then, the Bluetooth trace with assigned location will be used to construct the predictive model in Section V. Finally, we evaluate the predictive model in Section VI and conclude the paper in Section VII. II. OVERVIEW OF UIM T RACE AND J YOTISH A. UIM Collected Trace We deployed the UIM scanning system [14] on Google phones carried by 123 participants from March to August 2010 in University of Illinois campus, with three rounds: from beginning of March to end of March, from beginning of April to mid of May, and from end of May to mid of August. Many participants participated from one month to two months of experiment. The participants included faculties and students. More detail of UIM system and its collected data set can be found in our previous paper [14]. UIM system has a Wifi scanner and a Bluetooth scanner. The former periodically (i.e., every 30 minutes) captures MAC addresses of Wifi access points while the latter periodically (i.e., every 60 seconds) captures MAC addresses of Bluetooth-enabled devices in proximity of experiment phones. The above scanning frequencies are set to conserve phone battery since: (1) most participants use experiment phones as their daily phones, and (2) the Wifi scanner
consumes much more power than the Bluetooth scanner. These scanning frequencies conserve phone battery for 2 days (including other usages of participants) and make it acceptable for participants to carry phones for the prolonged experiment. The traces collected by scanners are called “Wifi trace” and “Bluetooth trace” respectively. Henceforth, we use terms Bluetooth and BT interchangeably.
Scan Time 03/08/10 09:20 03/08/10 09:50 03/08/10 10:20 03/08/10 13:50 03/14/10 08:20
Table I
Wifi MACs a1 , a 3 a1 , a 5 a6 a4 , a 7 , a 9 a1 , a 3
E XAMPLE OF W IFI TRACE W
In order to clarify the presentation of this paper, we use Relational Algebra [18] to represent and manipulate collected data set. So, we use the terms “set”, “table” and “relation” interchangeably, “record” and “tuple” interchangeably. Also, we use “person” and “phone” interchangeably, “stay duration” and “duration” interchangeably. For an experiment phone p, let D be the entire collected data set, so D = W ∪B, in which W is the relation representing the collected Wifi trace and B is the relation representing the collected Bluetooth trace. Tables I and II show examples of W and B. W has multiple Wifi tuples: W = {w1 , w2 , w3 , ..., w|W | }. Each tuple wi ∈ W is in the format of wi =< ti , Ai >, where Ai a set of Wifi MACs returned from one Wifi scan and ti is the scan time of that Wifi scan. So, we have Ai = {a1 , a2 , ..., aj , ...}, in which aj is the j th Wifi MAC scanned by the Wifi scanner of p during the entire experiment period. In Table I, each row is one tuple wi . Let WA be the set of all Wifi MACs scanned by the Wifi scanner for the entire experiment period of one experiment phone. For the Table I, WA = {a1 , a3 , a4 , a5 , a6 , a7 , a9 }.
Scan Time 03/08/10 09:20 03/08/10 09:21 03/08/10 09:22 03/08/10 13:50 03/14/10 08:14
Table II
BT MACs u1 , u 3 u1 , u 3 u1 u4 , u 9 u1 , u3 , u8
E XAMPLE OF BT TRACE B
Similarly, the relation B has multiple BT tuples: B = {b1 , b2 , b3 , ..., b|B| }. Each tuple bi ∈ B is in the format of bi =< ti , Ui >, where Ui a set of BT MACs returned from one BT scan and ti is the scan time of that BT scan. So, we have Ui = {u1 , u2 , ..., uj , ...}, in which uj is the j th BT MAC scanned by the BT scanner of p during the entire experiment period. Let BA be the set of
B: set of BT records Step 3 C: set of BT records with locations Step 5 Constructing location predictor, stay duration predictor, contact predictor Output the predictive model with three predictors Step 4 Assigning locations for Bluetooth records
Figure 2.
Execution of UIM Clustering algorithm
Step 6
Functional component
Data set
Figure 1.
Overview of Jyotish framework
all BT MACs scanned by the BT scanner for the entire experiment period of one experiment phone. For the Table II, BA = {u1 , u3 , u4 , u8 , u9 }. Notice that since the Wifi scanner and BT scanner run concurrently, the scan times of tuples of W and B overlap. 1) Contact Definition: We say the experiment phone p has a contact with a device pj whose BT MAC is uj if uj appears in one tuple of p’s BT trace B. This contact definition can also be found in previous papers [10], [11]. We assume that when p and pj have a contact, the user of p and the user of pj have a social contact. Henceforth, we use the term “social contact” and “contact” interchangeably. B. Jyotish Overview Figure 1 shows steps of the Jyotish framework to construct the predictive model from the joint Wifi/Bluetooth trace D. In the first and the second steps, we cluster Wifi records in W into locations (see Section III). Then, in step 3 and 4, we construct a Naive Bayesian classifier to assign locations for records in BT trace B (see Section IV). In step 5 and 6, the BT trace with assigned location is used as the input to construct location predictor, stay duration predictor, and contact predictor (see Section V). Henceforth, we use “stay duration” and “duration” interchangeably. III. C LUSTERING W IFI R ECORDS INTO L OCATIONS This section presents an algorithm called “UIM Clustering” to cluster Wifi records into clusters. This section focuses on step 1 and 2 in Figure 1. A. UIM Clustering Algorithm Overview There are several challenges in obtaining locations from Wifi records of W . First, since the Wifi signal fluctuates, although the phone stays in one fixed position, it may obtain different results for different Wifi scans. Previous work [16], [17] used the signal strength to cluster Wifi MACs into
locations and suffered from Wifi signal fluctuation. Second, if the phone is in the middle of two adjacent buildings, the Wifi scanned result might be partially overlapped with the scanned results obtained when the phone stays inside either of the buildings. Fortunately, the movement pattern of people is relatively regular since they tend to stay more frequently at their regular places. So, if two Wifi MACs a1 , a3 appear together more frequently than two Wifi MACs a1 , a5 in the Wifi trace W , then it is likely that a1 , a3 stay close in a physical building. That means, it is better to group a1 and a3 into the same location than a1 and a5 . So, we exploit the regularity of people movement to cluster Wifi MACs into locations. Moreover, our approach provides a finer grain of location than that derived from cellular base station [7] since the transmission range of Wifi access points is much shorter than cellular base station.
Name ∆ γAi γ Gθ CC S γCi CF θ Description Set of good records of W , ∆ ⊂ W The binary bit vector of Ai , |γAi | = |WA | Set of binary vectors: γAi ∈ γ where Ai ∈ ∆ The similarity graph: Gθ =< Vθ , Eθ > Candidate Cluster Set obtained from Gθ Signature vector of cluster Ci ∈ CC Final Cluster Set obtained from CC The similarity threshold
Table III
M AJOR NOTATIONS USED BY UIM C LUSTERING A LGORITHM
In our algorithm, for each record (or tuple) wi =< ti , Ai >, we do not use the scan time ti and only use Ai . Thus, in this section, we use Ai to represent the record wi . In other sections, we use wi to represent the record ith of W . We first define location as a unique set of Wifi MACs, which appear frequently together in the records of W . In Table I, the pair a1 , a3 appears twice together while a1 , a5 appears once. So, we say a1 , a3 appear together more frequently in W than a1 , a5 . Figure 2 shows the execution block diagram of the UIM Clustering algorithm. In step 1, given the records in W , we obtain the sub set of good records ∆ ⊂ W (see Section III-B). In step 2, we measure the similarity between all pairs of records of ∆ and construct a similarity graph Gθ , in which each vertex of Gθ is a record of ∆. In step 3, we apply the Star Clustering algorithm [19] to cluster vertexes into a set CC of candidate clusters. Finally, candidate clusters are merged based on their similarity measures to obtain the set CF of final clusters. Each cluster in CF can be used to represent one location. Table III represents major notations
F
4 p et S
C
3 p et S
θ
2 p et S
1 p et S
W: set of Wifi records
Step 1
Clustering Wifi records into locations
Step 2
F: set of Wifi records with locations
Wifi Records in W
Good Set of Records ∈W
Similarity Graph G
Candidate Cluster Set C
Final Cluster Set C
used by the UIM Clustering algorithm. B. Obtaining the Good Set ∆ of Wifi Records This section focuses on the Step 1 in Figure 2. First, we define a good record as a record that consists of Wifi MACs appearing frequently together in the records of W . We determine if a record Ai ∈ W is a good record as follows: for each pair of Wifi MACs (aj , ak ) ∈ Ai , we calculate the support value sj,k , which represents how frequently the pair (aj , ak ) appears together in the same records of W : sj,k = c(aj , ak ) min{c(aj ), c(ak )} (1)
1
1
0
1
0
….
1
1
0
….
0
0
Figure 3.
Bit vector γAi , with Ai = {a1 , a2 , a4 , a10 }
In Equation 1, c(aj ) is the number of records Ai ∈ W in which aj ∈ Ai . c(aj , ak ) is the number of records Ai ∈ W in which aj ∈ Ai , ak ∈ Ai . Intuitively, sj,k is similar to the notion of support value of Frequent Item Set in Data Mining literature. For the denominator of Equation 1, we have min of c(aj ) and c(ak ) since we are interested in the Wifi MAC which appears in less number of records and the association of this Wifi MAC with the other one in the pair. This min value represents the coexistence of the two Wifi MACs in the records of W . We have sj,k ∈ [0, 1] and the greater value of sj,k means the two Wifi MACs appear together in the same records of W more frequently. Let |Ai | be the number of( Wifi MACs of the record Ai . ) For each Ai ∈ W , we have |Ai | pairs of Wifi MACs and 2 (|Ai |) support values, which constitutes a distribution. Let 2 λAi and ξAi be the mean and standard deviation of this distribution. If Ai has only one Wifi MAC, then λAi = 1, ξAi = 0. Intuitively, we prefer a greater value of λAi since it means Ai contains Wifi access points that often appear together in the records of W . We prefer a smaller value of ξAi since it means support values stay in a small range. ξ So, for Ai , we calculate the ratio λAi to: (1) select good Ai record whose Wifi MACs appear together frequently in the same records of W , and (2) remove bad records consisting of wifi MACs, which do not frequently appear together in records of W . Let FW be the set of records, where each ξ record Fi ∈ FW is in the format of Fi =< λAi , Ai >, Ai with Ai ∈ W . We then sort records of FW increasingly ξ with respect to their ratios λAi and create the set ∆ of good Ai records from FW as follows. Let ∆A be the set of all Wifi MACs in the records of ∆. We scan FW from the beginning and for a record Fi ∈ F , Fi is added to ∆ if adding Wifi MACs of Fi to ∆A increases the size of ∆A . We stop adding records from FW to ∆ when |∆A | = |WA |. Since the added ξ records into ∆ has a small value of λAi , we reduce the size Ai of ∆ and remove most of noise data in W . C. Constructing Similarity Graph Gθ This section focuses on the Step 2 in Figure 2. Given the good set ∆, we convert Ai ∈ ∆ into a binary bit vector γAi
as follows. If the Wifi MAC aj ∈ Ai , then the j th bit of the vector γAi is set to 1, γAi [j] = 1; otherwise, γAi [j] = 0. Figure 3 shows an example of the binary bit vector. Notice that |γAi | = |WA |. Let γ be the set of binary vectors obtained from all records Ai ∈ ∆. Then, we use the Tanimoto coefficient [20] (the cosine similarity for binary vectors) to calculate the similarity measure Tp,q between a pair of vectors γp ∈ γ, γq ∈ γ: Tp,q = γp · γq ||γp || + ||γq || − γp · γq
2 2
Next, we construct the similarity graph Gθ =< Vθ , Eθ >, in which each vector γp ∈ γ is considered a vertex vp ∈ Vθ . For a pair of vertexes vp , vq ∈ Vθ , the edge (vp , vq ) exists (i.e., (vp , vq ) ∈ Eθ ) if Tp,q ≥ θ. θ is a threshold that determines the topology of Gθ and has important impacts on the clustering result (see Section III-F). D. Obtaining Candidate Cluster Set CC This section focuses on Step 3 in Figure 2. Particularly, we apply the Star Clustering algorithm [19] to cluster vertexes of Gθ into clusters since Star Cluster does not require a pre-defined number of clusters like others such as k-means and hierarchical clustering. Star Clustering thus fits very well to our context since we do not know in advance the number of locations from the Wifi trace W . Applying Star Clustering algorithm, we first sort the vertexes decreasingly according to their node degrees. Then, we scan the sorted list of vertexes, for each vertex vp if vp is not in any clusters, vp is considered as the center of a new cluster. For each neighbor vq of vp , if vq does not belong to any clusters, vq is included in the cluster centered at vp . The process continues until all the vertexes belong to clusters. We denote this set of clusters the candidate cluster set CC . E. Obtaining Final Cluster Set CF This section focuses on the Step 4 in Figure 2. For a cluster Ci ∈ CC , Ci consists of a set of vertexes, each vertex is a binary vector representing a record w ∈ ∆. S S Let γCi be the signature vector of the cluster Ci . γCi is obtained by applying the OR bitwise operation over all the binary vectors of Ci . Intuitively, the signature vector S γCi represents the set of Wifi MACs, which belong to S the cluster Ci . Thus, the signature vector γCi can be used to uniquely distinguish clusters in CC . Then, we use the signature vectors to merge cluster C1 ∈ CC into cluster C2 ∈ CC if C1 is a sub cluster of C2 . Formally, C1 is S S S merged into C2 if γC2 = (γC1 OR γC2 ). So, we have the
|A W |
0
11
01
9
5
4
3
2
1
a
a
a
a
a
a
a
a
a
(2)
100 Correct Location Assignement (%) 95 90 85 80 75 70 65 User 1 User 2 User 3 User 4
people is greater than 96%. When θ = 0.05, clusters are merged into big cluster; or nearby locations are merged into one location, it may incur “too big locations” and result in incorrect location assignment. In contrast, when θ increases, nearby clusters are separated. Thus, for a high value of θ (e.g., 0.1 < θ ≤ 0.9), two records of the same location may be assigned into different clusters. So, we use θ = 0.1 to evaluate the predictive model in Section VI.
Similarity threshold
60 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Figure 4.
θ = 0.1 gives most correct locations
final set of clusters CF , in which each cluster Cj ∈ CF can be used to represent one particular location. Given the final cluster set CF , we classify all Wifi records Ai ∈ W into clusters in CF as follows. Each record Ai ∈ W is classified to the best matched cluster Ci ∈ CF based on S the similarity measure between γAi and γCi calculated by Equation 2. The output of this step is the relation F of all Wifi tuples in W with assigned locations as shown in Table IV. Formally, F = {wi =< ti , Ai , Li >: wi =< ti , Ai >∈ W }, where Li is the location assigned by the UIM Clustering algorithm to wi .
Scan Time 03/08/10 09:20 03/08/10 09:50 03/08/10 10:20 03/08/10 13:50 03/14/10 08:20 Wifi MACs a1 , a 3 a1 , a 5 a6 a4 , a 7 , a 9 a1 , a 3 Loc L1 L1 L5 L8 L1
Name µ F M C α βmin ν τ
Description # of unique BT MACs collected by all phones A relation of Wifi tuples with assigned locations A relation of BT tuples created by F , B and α BT trace with assigned locations The time window (in second) The threshold to assign “Unknown” location The type of day, ν ∈ {weekend, weekday} Time slot
Table V
M AJOR NOTATIONS USED BY THE P REDICTIVE M ODEL
IV. A SSIGNING L OCATIONS FOR B LUETOOTH R ECORDS Although tuples of F are assigned locations, they do not provide needed granularity since the Wifi scanner scans each every 30 minutes. During this period, the phone may move to different locations. Meanwhile, our BT scanner scans every minute. Our goal is to assign locations from tuples of F to tuples of B and thus obtain the finer granularity of people movement. The first step towards this goal is to map tuples of F and tuples of B using a time window α. This section focuses on step 3 and 4 in Figure 1. Table V presents the major notations used in following sections. A. Mapping between Wifi Records and BT Records Using Time Window α
Table IV
E XAMPLE OF RELATION F .
F. Setting value of Similarity Threshold θ In this section, we empirically set the value of θ as follows. We first select 4 different participants and create for each of them a development set WD , which consists of 64 Wifi records scanned in two different days. Then, we ask the participants to manually label the location for their Wifi records (e.g., Long’s home, Quang’s home, Klara’s office, etc.). For each value of θ ∈ [0.05, 0.9], we perform following steps. For each pair of records (A1 , A2 ) ∈ WD , we check cluster ids of A1 and A2 in F and compare these cluster ids with the labeled locations in WD . A location assignment made by UIM Clustering algorithm is correct if: (1) A1 and A2 have the same labeled location in WD and they are assigned into the same cluster in F , or (2) A1 and A2 have different labeled locations in WD and they are assigned into different clusters in F . Figure 4 shows the percentage of correct classification the clustering algorithm makes when θ varies from 0.05 to 0.5. The best value of all people is 0.1, in which the correct prediction for all 4
For a tuple wk =< tk , Ak , Lk >∈ F , we know that the person p stays at the location Lk at time tk . We observe that during the time window [tk −α, tk +α], if α is short enough, the person usually stays at the location Lk . Therefore, we can assign the location Lk to all BT records bi =< ti , Ui >∈ B, in which tk − α ≤ ti ≤ tk + α. Let M be the relation of all BT tuples bi ∈ B, which are assigned locations Lk by using the time window α and the tuple wk =< tk , Ak , Lk >∈ F . Formally, M = {bi ′ =< ti , Ui , Lk >: bi =< ti , Ui >∈ B, wk =< tk , Ak , Lk >∈ F, tk − α ≤ ti ≤ tk + α, 1 ≤ i ≤ |B|, 1 ≤ k ≤ |F |}. Table VI shows an example of the relation M , which is created through the mapping between relation B and relation F using the time window α. We will present a separate section on how to set the value of α in Section IV-C. B. Assigning Locations for Bluetooth Records We construct a Naive Bayesian classifier NB to predict the locations of all BT records in B. Basically, we use the
Scan Time 03/08/10 09:20 03/08/10 09:21 03/08/10 13:50 03/14/10 08:14
BT MACs u1 , u 3 u1 , u 3 u4 , u 9 u1 , u 3 , u 8
Loc L1 L1 L8 L1
P (uj |Lk ) =
c(uj ) + 1 c(Lk ) + µ
(7)
Table VI
E XAMPLE OF RELATION M .
relation M to train the Naive Bayesian classifier NB and then use NB to assign locations to all records bi ∈ B. 1) Training Naive Bayesian Classifier NB : For a BT record bi ∈ B, the probability that bi belongs to a location Lk is calculated by using the Bayesian Theorem as follows: P (Lk |bi ) = P (bi |Lk )P (Lk ) P (bi ) P (bi |Lk )P (Lk ) P (bi ) (3)
In Equation 7, µ is the number of unique BT MACs collected by all participants for the entire experiment period. Adding µ to the denominator of Equation 7 means we take into account all possible BT MACs in calculating the probability of the BT MAC uj . With Equation 7, P (uj |Lk ) ̸= 0 for all uj and we have: f (Lk ) = Πuj ∈bi c(uj ) + 1 P (Lk ) c(Lk ) + µ (8)
Then, bi belongs to the location Lbi calculated as follows: Lbi = arg max
k
(4)
Since P (bi ) is the same for all locations Lk , we calculate f (Lk ) = P (bi |Lk )P (Lk ). To calculate P (bi |Lk ), we assume that for u1 ∈ bi 3 and u2 ∈ bi , u1 and u2 are conditionally independent, or u1 and u2 appear conditionally independent in the proximity of the experiment phone when they are scanned (and bi is created) by the BT scanner. This assumption usually holds in reality since people (with their Bluetooth-enable devices) appear at locations independently. Let f (Lk ) = Πuj ∈bi P (uj |Lk )P (Lk ), we have: Lbi = arg max f (Lk )
k
(5)
The relation M is used to calculate f (Lk ) in Equation 5 as follows. P (Lk ) = c(Lk|) , where |M | is the size of M and |M c(Lk ) is the number of tuples bi ′ =< ti , Ui , Li >∈ M , in which Li = Lk . For P (uj |Lk ), we have: P (uj |Lk ) = c(uj ) c(Lk ) (6)
′ So, we have a new trained classifier NB by applying Equation 5 and Equation 8 for all tuples of M . ′ 3) The “Unknown” Location: Applying NB to assign locations to BT records bi ∈ B, we encounter records bi whose value of f (Lbi ) calculated by Equation 4 is extremely small. These records bi are scanned in the middle of two consecutive Wifi scans (the period between two Wifi scans is 30 minutes) when the phone carrier moves to another location, which is not captured by the Wifi scanner. Therefore, assigning any known location from the Wifi trace to bi results in a wrong assignment. To avoid this, we define a new location named “Unknown” location and assign the “Unknown” location to bi . The next question is “How small the value of f (Lbi ) is” so that the record bi is assigned to “Unknown” location. To answer this question, we use Equation 8 to calculate f (Lbi ′ ) for all records bi ′ ∈ M . Let βmin = minbi ′ ∈M f (Lbi ). We then use βmin as the threshold value to assign “Unknown” location to a BT record bi ∈ B. The intuition is as follows. We assume that records in M are “good records” whose locations are assigned correctly by the time window α. So, the minimum value of f (Lbi ′ ) of all records bi ′ ∈ M represents the cutoff value for all records whose locations are assigned correctly. For a record bi ∈ B, we have:
{ Lbi =
In Equation 6, c(uj ) is the number of records bi ′ =< ti , Ui , Li >∈ M , in which Li = Lk and uj ∈ Ui . Applying Equation 5 and Equation 6 for all records of M , we have the trained classifier NB . 2) Applying Additive Smoothing Technique: In section IV-A, since we only use a small time window α to create M , M does not cover all BT MACs in B. Thus, applying the trained classifier NB for a record bi ∈ B, the value c(uj ) in Equation 6 might be 0 if uj does not belong to any tuples of M . Thus, c(uj ) cancels out the value of P (ul |Lk ) of BT MACs ul ∈ bi (i.e., j ̸= l) in Equation 5. To avoid this, we apply the Additive Smoothing technique [21] for the Equation 6 as follows:
correct notation should be u1 ∈ Ui and bi =< ti , Ui >. However, to shorten the notation, we use u1 ∈ bi in this section.
3 The
arg maxi f (Lk ) “U nknown”
if f (Lbi ) ≥ βmin (9) otherwise
Equation 9 means bi will be assigned Lbi location only if f (Lbi ) ≥ βmin . Otherwise, bi will be assigned the “Unknown” location. Although this approach seems to be conservative in assigning correct locations to BT records, it does provide good result in our evaluation of the predictive model in Section VI. Let C be the relation consisting of all records in B, which are assigned locations. Then, we sort C increasingly according to the scan times of its tuples and use C as the input to construct our predictors in Section V. C. Setting Value of Time Window α As we presented in Section IV-A, the value of α decides the mapping between Wifi records and BT records and the size of relation M , which is used to train the Naive Bayesian
100 Correct Location Assignement (%) 98 96 94 92 90 30 40 50 60 70 80 User 1 User 2 User 3 User 4 90 100 110 120 Time Window (s)
ν weekday weekday weekday weekday weekend weekday weekend
τ 08-10 08-10 08-10 12-14 08-10 08-10 14-16
Scan Time 03/08/10 09:20 03/08/10 09:21 03/08/10 09:22 03/08/10 13:50 03/14/10 08:12 03/15/10 09:47 03/20/10 15:23
Table VII
Loc L1 L1 L1 L8 L8 L1 L3
BT MACs u1 , u3 u1 , u3 u1 , u8 u4 , u9 u4 , u12 u1 u15
E XAMPLE OF BT TRACE WITH ASSIGNED LOCATION C
Figure 5.
Time window α = 60(s) gives most correct locations
classifier NB . In this section, we use the same technique in Section III-F to empirically set value for α. Particularly, we select 4 participants and for each of them, we create a set BD of BT records of two days and ask the participants to manually label locations for records in his BD . For two days, each BD has 960 records. For each pair of records (b1 , b2 ) ∈ BD , we check locations of b1 and b2 in C assigned by our Naive Bayesian classifier and compare these locations with the labeled locations in BD . Figure 5 shows that when α = 60(s), the relation C outputted by the Naive Bayesian classifier obtains the best location assignment, in which the correct prediction for all 4 people is greater than 95%. With α = 30(s), the relation M consists of too few records to train a good Naive Bayesian classifier. Meanwhile, α > 60(s) is too large a time window, which incurs noisy data in the relation M since BT records may be assigned wrong locations if they fall into this big time window. The trained classifier NB then performs worse than that with α = 60(s). So, we use α = 60(s) to evaluate the performance of our predictive model in Section VI. V. C ONSTRUCTING L OCATION P REDICTOR , D URATION P REDICTOR , AND C ONTACT P REDICTOR Given the relation C, we construct the location predictor, duration predictor, and contact predictor. This section focuses on step 5 and step 6 in Figure 1. To construct our predictors, we use two parameters: type of day and time slot. Let ν be the “type of day” and τ be the “time slot”. Particularly, we classify days into two types: weekend and weekday, so ν ∈ {weekday, weekend}, and divide time of a day into time slot of size 1, 2, 4, etc. hours. The motivation for the use of these two parameters is that people may visit different places and contact different people for the weekday and weekend. For each record r ∈ C, we map r’s scan time into type of day ν and time slot τ . Table VII shows an example of the relation C in which its tuples are mapped into type of day and time slot of size 2 hours. The relation C in this new format is used to construct our predictors.
For a person p, the input query for p’s movement prediction is a record X in the format of X = {ν1 , τ1 }, in which ν1 represents the type of day and τ1 represents the time slot. The output will be location the p stays at, the duration p stays at the location, and contacts p has for the type of day ν1 and during time slot τ1 . A. Location Predictor We use Naive Bayesian classifier to predict the location of the person as follows. LX = arg max{P (ν = ν1 |Lk )P (τ = τ1 |Lk )P (Lk )} (10)
k
Here, we assume that ν and τ are conditionally independent with respect to location Lk . Equation 10 outputs the most likely location LX for the input query X. Moreover, Equation 10 can be easily customized to return top-k of most likely locations for input query X. In this case, LX is the set of top-k most likely locations and we have a top-k location predictor. B. Duration Predictor The duration predictor is constructed based on the location predictor. If the location predictor returns the top-k locations, the duration predictor will return the predicted stay duration for each of k locations. We first define the “stay session at the location Lk ” is the continuous time period that the person stays at Lk . In our context, since the BT scanner obtains BT records every minute, the “stay session at the location Lk in minute” is the size of the relation Φ of consecutive tuples in relation C such that for two consecutive tuples r1 , r2 ∈ Φ, the difference of scan times between r1 and r2 is exactly 1 minute. Let |Φ| denote the session length of one stay session of Lk . We first use location predictor to obtain the location Lk for the input query X = (ν1 , τ1 ). Then, we create a sub relation C ′ = σν=ν1 ,τ =τ1 ,Loc=Lk (C) where σ is the selection operator over the relation C [18]. Then, we calculate the session lengths for Lk from the relation C ′ using the above session definition. Let Γk be the set of all stay session lengths for the location Lk obtained from set C ′ , Γk = {Φ1 , Φ2 , Φ3 , ..., Φ|Γk | }, Γk forms a distribution of session lengths. Let λk and ξk denote the mean and standard
deviation of this distribution. For example, the location L1 in Table VII has Γ1 = {3, 1}, here |Φ1 | = 3 and |Φ2 | = 1 (Φ1 consists of the first three records). The output of the duration predictor includes λk , and ξk for each location Lk . C. Contact Predictor In order to construct the contact predictor, we assume that each BT MAC scanned by the BT scanner is associated with a distinct person. As a result, each scanned BT MAC in a record of the BT trace represents a contact. We apply the Naive Bayesian classifier to find the most likely contact the person p will have for the input X = {ν1 , τ1 } as follows: UX = arg max{P (ν = ν1 |uj )P (τ = τ1 |uj )P (uj )} (11)
j
Here, we assume that ν and τ are conditionally independent with respect to contact uj . Equation 11 outputs the most likely contact UX for the input query X. The Equation 11 can be easily customized to return the top-k of the most likely contacts for the input query X. In this case, UX is the set of top-k most likely contacts and we have a top-k contact predictor. The contact predictor predicts the future contacts, which is crucial for the design of routing protocols and content distribution protocols in MANET and DTN. VI. E VALUATION OF THE P REDICTIVE M ODEL A. Evaluation Settings From March to August 2010, we had 50 joint Wifi/Bluetooth trace collected by 50 experiment participants in University of Illinois campus. Each trace is from 20 to 50 days. Let Di be the Wifi/Bluetooth trace of the ith participant in 50 participants: Di = Wi ∪ Bi , where Wi is the Wifi trace and Bi is the BT trace. For ith participant, we first apply the UIM Clustering Algorithm over Wi to obtain locations. Then, we apply steps in Section IV to assign locations to records in Bi . For the user ith , let Ci be the Bluetooth trace with assigned location. We divide the relation Ci into two distinct sub set called training set Ψi and testing set Ωi , in which Ψi ∩ Ωi = ∅. The training set Ψi has 80% of records in Ci and Ωi has 200 records randomly picked from the set Ci \ Ψi . We use Ψi to train three predictors (i.e., location, stay duration, and contact) and use Ωi to evaluate these predictors. Each record r ∈ Ωi is converted into the format of X = {ν1 , τ1 } and used as the input for our predictors. We set θ = 0.1, α = 60(s), and time slot to 2 hours in the following plots. We run the experiment 10 times (i.e., each time a new set Ωi is created at random) and plot the average for each experiment participant. More evaluation results can be found in our technical report [22]. B. Correctness of Predictors 1) Location predictor: Let Li p be the location predictor of the ith experiment participant. For each record r ∈ Ωi , we use Li p to predict the location of r using technique in
Section V-A. Let Lr be the location of r ∈ Ωi . Notice that we only evaluate the correctness of Li p for record r whose Lr is not “Unknown”. Since the predictor Li p can output the top-k most likely locations, let Lpred be the set of predicted locations outputted by Li p so |Lpred | = k. Li p makes a correct prediction if Lr ⊆ Lpred . Figure 6(a) shows the correctness of Li p for 50 users with k from 1 to 3. When k increases, the set Lpred has more elements, thus the prediction is more likely to be correct, which is confirmed in this figure. Particularly, when k = 2, about 80% of nodes have more than 70% correct predictions. When k = 3, about 85% of nodes have more than 80% correct predictions. This shows that the location predictor provides an accurate location prediction. 2) Duration predictor: Let Λi p be the duration predictor of the ith experiment participant. Let λpred and ξpred be the mean and standard deviation values return by Λi p for the input query X = {ν1 , τ1 }. Then, we use the definition in Section V-B to find the stay session that contains r in Ci . Notice that r should belong to an unique session since r has its own scan time and location. Let Λr be the length of the stay duration session that contains r in Ci . Since stay duration may vary significantly, predicting stay duration becomes challenging. Thus, we evaluate the correctness of Λi p as follows: if λpred − ξpred ≤ Λr ≤ λpred + ξpred , then Λi p makes a correct prediction. Here, we use the top-1 location predictor whose returned location is not “Unknown”. Figure 6(b) shows that the duration predictor performs considerably well. Particularly, 80% of nodes obtain about 60% correct prediction and 40% of nodes have about 80% correct prediction. Since the stay duration of people at one location is difficult to predict, we believe this result confirms that the duration predictor can provide an relatively accurate duration prediction. 3) Contact predictor: Let P i p be the contact predictor of the ith experiment participant. For each record r ∈ Ωi , let Ppred be the set of top-k contact returned by P i p , so |Ppred | = k. We evaluate P i p as follows. First, let Pr be the set of contacts appearing in Ci in the same day of type ν1 and during the same time slot τ1 of r. Second, the predictor P i p makes a correct prediction if Ppred ∩ Pr ̸= ∅. The intuition is that P i p predicts that in the day of type ν1 and during the time slot of τ1 , Ppred is the set of contacts, in which the person p will have at least one. Figure 6(c) shows that P i p performs better when k increases from 1 to 7. With k=7, about 80% of participants can obtain more than 70% correct prediction and about 60% of participants obtains more than 80% correct prediction. VII. C ONCLUSION Jyotish framework provides an efficient solution to construct a predictive model of people movement from the joint Wifi/Bluetooth trace. Evaluation over the real Wifi/Bluetooth trace collected by 50 participants shows that the constructed
CDF - Percentage of Nodes (%)
CDF - Percentage of Nodes (%)
80 60 40 20 0
k=1 k=2 k=3
CDF - Percentage of Nodes (%)
100
100 80 60 40 20 0 Duration Prediction 10 20 30 40 50 60 70 80 90 100 Duration Prediction Accuracy (%)
100 80 60 40 20 0
k=1 k=3 k=5 k=7
10 20 30 40 50 60 70 80 90 100 Location Prediction Accuracy (%)
10 20 30 40 50 60 70 80 90 100 Contact Prediction Accuracy (%)
(a) Location Prediction Figure 6.
(b) Duration Prediction
(c) Contact Prediction
Correctness of Location predictor, Stay duration predictor, and Contact predictor
predictive model predicts location, stay duration, and contact with a considerably high accuracy. To the best of our knowledge, Jyotish framework is the first to derive a predictive model from a real joint Wifi/Bluetooth trace. The constructed model is also the first to provide prediction of location, stay duration, and contact altogether. Since future knowledge of people movement is fundamental for numerous domains such as mobile wireless networks, HCI, social science, environmental science, etc., we thus believe Jyotish framework and its constructed predictive model are widely applicable. R EFERENCES
[1] M. C. Gonzalez, C. A. Hidalgo, and A.-L. Barabasi, “Understanding individual human mobility pattern,” Nature, vol. 453, pp. 779–782, June 2008. [2] G. Liu and G. Maguire, “A class of mobile motion prediction algorithms for wireless mobile computing and communications,” Mobile Networks and Applications, vol. 1, pp. 113– 121, June 1996. [3] W. Gao and G. Gao, “Fine-grained mobility characterization: Steady and transient state behaviors,” in Mobihoc, 2010. [4] L. Song, D. Kotz, R. Jain, and X. He, “Evaluating location predictors with extensive wi-fi mobility data,” in Proceedings of Infocom, 2004. [5] M. Sun and D. Blough, “Mobility prediction using future knowledge,” in Proceedings of MSWiM, 2007. [6] P. N. Pathirana, A. V. Savkin, and S. Jha, “Mobility modelling and trajectory prediction for cellular networks with mobile base stations,” in Proceedings of MobiHoc, 2003. [7] M. A. Bayira and M. Demirbasa, “Mobility profiler: A framework for discovering mobility profiles of cell phone users,” Pervasive and Mobile Computing, 2010. [8] L. McNamara, C. Mascolo, and L. Capra, “Media sharing based on colocation prediction in urban transport,” in Proceedings of Mobicom, 2008. [9] J.-K. Lee and J. C. Hou, “Modeling steady-state and transient behaviors of user mobility: formulation, analysis, and application,” in Proceedings of Mobihoc, 2006.
[10] A. Chaintreau, P. Hui, J. Crowcroft, C. Diot, R. Gass, and J. Scott, “Impact of human mobility on the design of opportunistic forwarding algorithms,” in Infocom, 2006. [11] P. Hui, A. Chaintreaum, J. Scott, R. Gass, J. Crowcroft, and C. Diot, “Pocket switched networks and human mobility in conference environments,” in Proceedings of the ACM SIGCOMM workshop on Delay-tolerant networking, 2005. [12] J. Leguay, A. Lindgren, J. Scott, T. Friedman, and J. Crowcroft, “Opportunistic content distribution in an urban setting,” in Proceedings of CHANTS, 2006. [13] E. P. S. Gaito and G. P. Rossi, “Opportunistic forwarding in workplaces,” in Proceedings of ACM WOSN, 2009. [14] L. Vu, K. Nahrstedt, S. Retika, and I. Gupta, “Joint bluetooth/wifi scanning framework for characterizing and leveraging people movement in university campus,” in Proceedings of MSWiM, 2010. [15] “Skyhook. http://www.skyhookwireless.com.” [16] J. Krumm and K. Hinckley, “The nearme wireless proximity server,” in In Proceedings of Ubicomp, 2004. [17] A. LaMarca, J. Hightower, I. Smith, and S. Consolvo, “Selfmapping in 802.11 location systems,” in Ubicomp, 2005. [18] “Relational algebra. http://en.wikipedia.org/wiki/relational algebra.” [19] J. Aslam, K. Pelekhov, and D. Rus, “Static and dynamic information organization with star clusters,” in Proceedings of the Conference on Information Knowledge Management, 1998, pp. 208–217. [20] “Tanimoto coefficient, http://en.wikipedia.org/wiki/cosine similarity.” [21] “Additive smoothing, http://en.wikipedia.org/wiki/additive smoothing.” [22] L. Vu, Q. Do, and K. Nahrstedt, “Exploiting joint wifi/bluetooth trace to predict people movement,” University of Illinois, Tech. Rep., 2010. [Online]. Available: http://hdl.handle.net/2142/16944