Selecting correct weights
I am a bit lost about the correct weights for analysis. My project makes use of school codes and I can see that in any given wave there are about 7-8K observations with non-missing school code. However, if I use information for all people who have ever answered this question in any wave, the sample size grows to about 16-17K observations. So, I feel tempted to have a sample of all people who have ever answered this question, but cannot figure out which weights I could use in such analysis. The documentation says that whenever a researcher uses information from multiple waves, he/she should use longitudinal weights. But my understanding and some of the previous answers in this forum made me think that longitudinal weights apply if I want to estimate changes from wave to wave. Could you please help me resolve this issue?
#1 Updated by Olena Kaminska 6 months ago
Thank you for your question. Could you kindly tell me what you are trying to study - schools or people, age range, alive now or ever alive etc. In other words if you tell me the population that you want to represent I would be able to help you with the correct weight.
#2 Updated by Nurfatima Jandarova 6 months ago
My population of interest are people who have finished their school in 80s, 90s and 00s. I am looking at their university education choices in response to labour market shocks, which are geography-dependent; hence, I am using their school codes which geography they belonged to at the time of finishing school.
#4 Updated by Nurfatima Jandarova 6 months ago
Currently, my analysis is cross-sectional. As of now, there is no difference for me if it is wave 1 or 8. Although, having wave 1 is more appealing because the sample excludes people coming from BHPS, so is more random in a sense. The issue is that if I look at people present only in wave 1, only 8K people provided school codes, which isn't much considering that I would like to have some cohort- and geography-specific estimates.
The next stage of project also looks at their earnings dynamics. And at this point I would like to use multiple waves for the same people (well, at least two, but really as much as data permits). In this case, I think it is quite clear that I should use longitudinal weights.
#5 Updated by Olena Kaminska 6 months ago
Yes, you are right - at the second stage you should use longitudinal weights, although if for example you only use 2 waves per person you could pool information from different waves and then give them longitudinal weights from the last wave of the wave combination.
For the first part of your project you can pool information from all waves and use cross-sectional weights for your analysis. So for wave 1 people you can use wave 1 xw weights, and so on. Make sure to control for clustering within PSUs. We advise that you add a scaling factor to your weights, especially if you use earlier waves of BHPS with UKHLS.
Finally, with weights both wave 1 and wave 2 are similarly representative of the population.
Hope this helps,
#8 Updated by Nurfatima Jandarova 3 months ago
Sorry to bother you again, but I am afraid I am not understanding longitudinal weights. My aim is to combine all waves into a single dataset and use it as a panel data.
1. I can see that in the GPS sample some of the observations with longitudinal weight equal to 0 still receive non-zero cross-sectional weight. I find it confusing since my understanding was that cross-sectional weights are derived from longitudinal weights.
2. The documentation describes the derivation of combined longitudinal weights in wave 3 and mentions that a similar thing does not exist in wave 2. What should I do if I want to use all (BHPS and UKHLS) people in wave 2? Should I use b_psnenub_li? Can I maybe combine b_indscus_lw and b_indin01_lw into one variable?
#10 Updated by Olena Kaminska 3 months ago
1. Indeed longitudinal weights have more zeros than cross-sectional - this is by design as TSMs get positive xw weights and 0 longitudinal weights (this is related to their selection probability into our panel). Both types of weights are correct, but have different uses depending on your analysis.
2. If you pull cross-sectional data you could use 'us' weight for w1 and 'ub' weight for w2, bot 'xw'. If you pull longitudinal 2-wave data, than for wave1-wave 2 combination you will have to use 'us' weight, but you can use 'ub' weight for the wave2-3 combination, both 'lw'.
3. BHPS people didn't participate at wave 1, so you can't easily combine them with UKHLS at wave 1. They joined UKHLS at wave 2, so from there on you can use them together, including any longitudinal analysis that starts at wave 2 (or later).
4. But yes, technically you can combine 'us' and '01' _lw weights. There are some assumptions that you would be making, but many other panels make them too. Your immigrants between 2001 and 2007 will be underrepresented, and joined probabilities to be part of either panel will be wrong, but the results should not be too far off from the correct ones. As long as you are explicit about your assumptions - you can be fine. Note, 01 weight has a much lower sample size of BHPS than ub weight.
Hope this helps,
#11 Updated by Nurfatima Jandarova 2 months ago
Thank you for all your help so far!
Can I ask more questions? Well, basically similar questions but now that I have a refined exactly what I need to do, I think they should be more specific.
So, let me explain the setup. I I am pooling all individuals from all ukhls waves together such that I maximise observations that can have no missing school codes. I want highest degree for each person when they first enter the labour market full time and their earnings history, these are my main variables of interest. In my analysis I want to regress earnings history on such degree dummy. So there are two issues that make me frustrated with weights.
1. Suppose I see a certain 18year old all throughout the waves. In the first wave the highest degree information is not yet fixed because likely the person is still in education. In wave 8 I see she got a degree and started working. So, I try to update the degree to take the latest possible information. If I update like this, then use only last observation per person and either use the first or last non missing cross-sectional weight of the person, the share of people with degree by year of birth looks nothing like the same statistic in any given wave ( I tried to check against the 9th wave). If I still the update this way, but use all waves and longitudinal weights, the picture is closer to the wave snapshot, but still not quite there. However, it sounds weird to use longitudinal weight for a cross-sectional statistic. I am not interested how degree information changes over time for people, I want to see how many people have got a degree in my sample.
2. For people born in 1980s understanding society will not show their earnings at the beginning of their career because at the time of the first wave they are already working for some time. In principle, I want to try to extend their earnings history according to a profile based on year of birth, age, education etc. I this case, I do not know what to do with weights. Should I assign the earliest weight to all those backward filled observations as if these people provided their history the first wave I observe them?
3. I think now I understand better your message about weights for BHPS people. Can you let me know if I understood correctly? So even though I can observe BHPS people in wave 2 I should still only use us weights which assigns 0 to them. Starting from wave 3 these observations may have a non-zero weight.
4. Also what do I do with longitudinal weights if I interpolate information between waves. Like for BHPS people if I interpolate their earnings between BHPS wave 18 and ukhls wave 2, what weights should the interpolated observations be assigned?
Sorry for the long message, hope it makes sense.
Thank you again for all your help!
#12 Updated by Olena Kaminska 2 months ago
See my answers below.
1. My understanding is that you need information on the highest degree for each person. I understand that you use most recent information from previous waves where possible. What is important for me is which set of people you end up with. So, for example, I have information on everyone who responded in wave 1 (some of this information comes from wave 1, some from later waves, but I treat it the same) - what's important is that everyone who is in wave 1 is in my analysis - in this situation you can use cross-sectional weight for wave 1 and treat it as a simple one wave analysis. If you want higher sample size, you can use wave 2 that also includes BHPS, but then use _xw weight for wave 2. If you structure the data this way you can use our weights.
I am not sure your pooling idea is correct. The best way to pool data from different waves is to think about events you want to represent. Imagine you want to study the first job a person has - this can be seen as an event and so you can then use all of the events observations from across the waves and pool them.
Ones you have a clear idea on how to construct your data correctly - answering a question about which weight to use will be very easy. I think your problem at the moment is the former one.
2. I am not an expert in this field but we have some information on employment history that may be helpful for you. The weight you use depends on how your data looks in the end. So again, if you have employment information either from wave 1 or from earlier time and put it together with wave 1 questions and analyse these from wave 1 - you can just use wave 1 cross-sectional weight. Again your problem is not a weight but you need to think how to represent people or events you want to in the best way.
3. I am not sure what you refer to here. If your analysis is cross-sectional at wave 2 use ub weight that includes BHPS. But if you analyse longitudinally wave 1 and wave 2 you will be limited to GPS +EMB samples so use us weight.
4. Do you mean by interpolating that you create new observations based on some earlier characteristics? This sounds like unit imputation. So, for example, if you impute all missing people (remember not to impute dead people) from 1991 onwards and in the end you have full information on the full 1991 sample you can use our cross-sectional 1991 weight as you have corrected for attrition through imputation already. In general, again look at which people you have ended up with. Importantly you want to have everyone with non-zero weight at some point minus noneligible and this is the weight you can use. I also would suggest that if you use information from previous waves to impute or to use for your analysis drop TSMs and use longitudinal weights, because by design TSMs are not followed longitudinally.
I hope this answers some of your questions.
#13 Updated by Nurfatima Jandarova 2 months ago
Thank you for a quick reply!
Ok, let me try to explain my setup again. I want to have a panel of people that would have been followed from the time they first finish full time education and enter the labour market. So t = 0, would be the year when this event happens and all subsequent periods record their earnings/employment history. At t=0 each of these people have a certain characteristic, importantly, some of them have degree and others don't. Then, I want to look at entire history of earnings within each group defined by this initial characteristic. In graphical sense, x-axis is t, y-axis is earnings and the two lines for earnings - one for people with degree, one for people without a degree.
To construct such dataset, I have pooled all observations from all UKHLS waves together. Even if someone only popped up once in wave 4, she ends up in this pooled dataset. Then, I am trying to filter out individuals whose latest observed degree information came only from the period when they haven't yet finished with full-time studies. I understand, it would have been easier to concentrate on observations from a given wave only and then only merge their information from other waves, but unfortunately this results in a far too little sample size. Therefore, I pooled all people together who have ever showed up in any wave and tried to select the 'usable' ones from them, namely individuals completed education phase of life.
I hope this clarifies it.
I am afraid I am not really understanding the first answer, last sentence. To me it sounds more or less what I am doing...
Could you please also clarify answers to points 2 and 4. I try to explain it better. In my "dream" dataset I would see all people from the very point they first start working life (or at least being-able-to-work life as I am also interested in unemployment probabilities as they progress over time) every year until they retire. Of course, UKHLS is not long enough for this, especially if I look at rather young people. So I just try to get the longest possible series for everyone. For some of the younger people I indeed can have the "dream" profile. For older people I would only observe them in the UKHLS having already been in the working phase for some time. For example, say person born in 1982 started working at the age of 22. So at least 5 years of earnings history is unavailable for such person at the time she enters UKHLS. Another case is when some people had breaks in between responses (for example, respond in waves 1 and 2, fall out in 3 and 4 and come back in 5 and 6). For both of these cases I was thinking of imputing the missing years of earnings history from estimated earnings profile.
So the imputation will be done for people already in my sample, just trying to fill in information backwards. Therefore, I don't think that death is an issue.
I see the point about TSMs and will drop them now from data generation code.
And really, thank you for patiently dealing with me! :)
#14 Updated by Olena Kaminska 2 months ago
Your detailed explanation really clarifies your data setup. So ignore some of my comments from before as some of them are not relevant. I have two important follow up questions:
1. Does your analysis deal with truncated information? For example survival analysis can use different length and is happy with truncated data. What I mean by this is what happens to people who have only 2 waves observation and then the person drops out? Are they still in your analysis? In this situation survival analysis deals with nonresponse due to attrition and because of this you don't need to correct for attrition through weighting - you would be using possibly a longitudinal weight from the first wave that you observe the person in this situation.
2. You mentioned imputing values for missing waves. This means that again you are correcting for some of the nonresponse already and you do not need weighting to do this part. If you didn't use survival analysis for example that deals with truncated data but instead imputed not only missing between waves but also all the following waves up until wave 9 for example you could use again wave 1 weight or the longitudinal weight from the first wave in your analysis.
Let me explain how the weight can work in your situation. In a way you aim to represent events (or people experiencing events). Your starting point is conditional on a particular life situation (event). Let's say we have 3% of people in wave 2 who experience this event, and 3% in each wave thereafter. Our wave 2 longitudinal weight represents these 3% correctly, our wave 3 lw weight represents wave 3 people correctly with this event. So you can pool them together using these longitudinal weights from the wave that they first start if you deal with attrition later yourself.
If your population of interest depends on being observed in later waves though (so for example they are in your analysis only if they also find a nice job) then you have to use the longitudinal weight from the last wave where definition has to come from. This is because some of the people among nonrespondents could have qualified for your definition and you want to represent them.
I hope this helps,