Support #785

Weighting for a complex sub-merged dataset

Added by Emily Lowthian over 2 years ago. Updated over 2 years ago.

Target version:
Start date:
Due date:
% Done:


Estimated time:


Hi there,

I have read the user guide 1-6 section on weighting (pages 56 - 83), however I am still slightly lost in what weights I can apply to this dataset.

I hope to do a fixed/random effects longitudinal model to understand how traumatic events, i.e. parental fighting, physical and verbal abuse towards children, reported by the parents (in both the hh and individual questionnaire) effect young people's physical and mental well-being (using the youth completion questionnaires); parents answers will be sub-merged on to the young peoples cases. After I have understood this relationship - I intend to understand how measures of affluence - material deprivation, socioeconomic group, income etc - moderate this relationship.

I'm slightly confused what weighting to use as I will be using both hh, individual and youth data and I will be using the data from Wave 1, 2, 4 and 6 (data available).

At the moment I would use in your conventional format w_xxxyyzz_aa = 1_ythscus_lw - however there is no longitudinal weights for youth data.

Please could you guide me in what the appropiate decision would be to take on this matter?


#1 Updated by Peter Lynn over 2 years ago

  • Target version set to X M
  • % Done changed from 0 to 10

Hi Emily,

I doubt that there exists a weight that is perfect for your analysis, but there may be a reasonable solution. To be able to advise on this, I need to understand better the analysis you wish to perform.

I'm unclear about your unit of analysis and the form of your model. Is the unit of analysis a child? So, effectively you are treating the parental information as attributes of the child?

And how will you use children who appear in some, but not all, of the youth data sets? (Very few will appear in all four waves - basically only those who were aged 10 at wave 1 - so I assume you are not restricting your analysis to those?) One way, for example, would be to use only one wave of youth data for each child, for example the most recent wave (corresponding to the oldest observed age for outcomes). Another way would be to use all observed waves for each child, but this will vary between 1 and 4 so you would need a model form that can deal with that structure.



#2 Updated by Emily Lowthian over 2 years ago

Hi Peter,

Thank you for getting back to me so soon, much appreciated.

Ok - I will admit here that I am a 1+3 masters student, and longitudinal analysis is very new to me; I am yet to perform any analysis, I am currently just reading.

Yes, my unit of analysis will the child's well-being as the outcome. You are correct, so I will be submerging the parents atributes on to children and treating them as child responses so to speak.

You final question about how I would use certain children is not something I have thought about as of yet. I imagine the best way to do it, commonsensically, would be to take the measures from wave 6. I was unaware of the complexity that time period dependent variables bring.

I hope that answers your question slightly more - if this is something which is better solved when my analysis has become more clear I am happy to readdress this.


#3 Updated by Peter Lynn over 2 years ago

  • % Done changed from 10 to 40
  • Private changed from Yes to No

OK, so let's assume the unit of analysis is the child and your outcome measures will come from wave 6. This seems sensible, as the set of children participating at wave 6 should be representative of all 10-15-year-olds. (Outcomes are very likely to depend on the child's age, though, so you would certainly want to include this in your model, I imagine.)

In that case, the valid units for your analysis would be all wave 6 youth respondents for whom the relevant parent and household variables (the ones you need for your analysis) have also been observed. I would use the wave 6 cross-sectional youth weight, but if the proportion of wave 6 responding youths with missing parent/household variables is quite high/skewed, you might want to make an additional adjustment to the weight to account for this. This would involve modelling this missing data, with your dependent variable being a binary indicator of whether or not the parent/household variables are observed.

There is a question about how you define eligible parent/household observations. I imagine you would want them to be observed at a prior wave (this is usual practice in longitudinal analysis aiming to identify causal effects). So, you could say that you want them from wave 4. Or from wave 1 (if you want to look at effects over several years). Or you could take them from whichever wave they have been (most recently) observed. The extent of missing covariate data will depend which of these options you think is the most appropriate for your research.

I've gone a little beyond your question about which weight to use, but I hope this helps!


#4 Updated by Emily Lowthian over 2 years ago

I will be controlling for age, yes.

Ok that makes sense - although, it does specify that weights using wave 1 data should only use the GPS and EMB weights - will that still be relevant (weight: a_ythscus_xw)? Or is it more to do with weights and your outcome variables?
Also, should I be concerned about the use of cross-sectional weights on longitudinal analysis? It sounds concerning in terms of the reliability of the data.

I would like to use waves 1, 2, 4 and 6 (I may exclude wave 1 due to it missing a few variables I want; I'm looking to work around it).

I too am slightly concerned about the level of missing data that I may see in this model, but your comments around this are very much appreciated and it will be something I take to my supervision in a couple of weeks time.

Thank you for this, I really do appreciate it!



#5 Updated by Peter Lynn over 2 years ago

The choice of weights depends on the set of analysis units that you want to include in the analysis, i.e. the definition of "response" for your purpose (= relevant observations are present). Not really on what type of analysis you then want to do with those units.

I'm not sure what you mean by including waves 1, 2, 4 and 6. If you mean that the parent info must have been observed in all 4 waves, then this will restrict your analysis sample further, compared to accepting predictor variables from any wave.

Are you coming to our Understanding Society conference in July? If yes, maybe we could discuss this then?


#6 Updated by Emily Lowthian over 2 years ago

Ok, that makes sense.

I did want to include waves 1,2,4 and 6 to look at different time points of traumatic events - however, the more research I do the less I think that will be possible; I think the best way to do this is to take the events at e.g. wave 4 and then observe the outcome at wave 6. My apologies for my lack of understanding.

I would like to go, however, I am an MSc student and my ESRC stipend does not cover research trips in my +1 year. I am also working out which day would be ideal - I imagine this would be Wednesday's (day 2) session. I would like to come.



#7 Updated by Victoria Nolan over 2 years ago

  • Status changed from New to Closed

Also available in: Atom PDF